CN117851860A

CN117851860A - Method for automatically generating data classification grading template

Info

Publication number: CN117851860A
Application number: CN202311712760.0A
Authority: CN
Inventors: 全锦琪; 李韬
Original assignee: Tianyi Cloud Technology Co Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2023-12-13
Filing date: 2023-12-13
Publication date: 2024-04-09

Abstract

The invention provides a method for automatically generating a data classification grading template, which comprises the following steps: s1, reading a preset data source; s2, carrying out hierarchical clustering with constraint; s3, forming a classification catalog of a multi-layer tree structure; s4, training a machine learning model as a recognition rule of leaf classification; and S5, applying the classification to the target data source. According to the invention, metadata information and content sampling data are extracted from a preset data source, the metadata information and the content data are comprehensively considered, a hierarchical clustering and natural language processing technology are combined to generate a classification catalog and a classification name with high descriptive and accuracy, and meanwhile, a machine learning model is trained to generate a leaf classification recognition rule, so that a complete classification template is formed.

Description

Method for automatically generating data classification grading template

Technical Field

The invention relates to the technical field of data classification and grading, in particular to a method for automatically generating a data classification and grading template.

Background

In the current information age, the amount and complexity of data grows exponentially, management and security of data face increasing challenges, and classifying and ranking data and more efficient protection are important tasks. By classifying the data, the data with similarity is assigned to the same category, and the data resources can be better organized and managed. By ranking the data, appropriate access rights and security control measures can be determined according to the category, degree of sensitivity and importance to which it belongs.

To implement data classification, it is generally necessary to construct a classification template. The classification hierarchy template typically contains a hierarchical tree structured classification directory in which each leaf classification contains specific identification rules for identifying whether the data belongs to that class. After the classification and grading template is constructed, the classification and grading template can be applied to the identification and classification and grading of the data source data.

It is noted that the construction of the classification hierarchical template is generally performed manually at present, and a professional is required to define classification standards and recognition rules of data according to field knowledge and experience. However, as data size and complexity increase, manual design and construction of templates and recognition rules becomes difficult and time consuming, and subject to subjective bias and error.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention aims to provide a method for automatically generating a data classification template, which can improve the efficiency of constructing the classification template and the identification accuracy.

In order to achieve the above object, the present invention is realized by the following technical scheme: a method of automatically generating a data classification ranking template, the method comprising the steps of:

s1, reading a preset data source;

s2, carrying out hierarchical clustering with constraint;

s3, forming a classification catalog of a multi-layer tree structure;

s4, training a machine learning model as a recognition rule of leaf classification;

and S5, applying the classification to the target data source.

Further, the step S1 includes: the data source comprises a field or a file, and when the data source is read, metadata information and content sampling data in the data source are acquired, so that a data basis is provided for the subsequent generation of the classification and grading template.

Further, the step S2 further includes the following steps:

s21, presetting a field or a file in a data source as an initial class cluster, forming a hierarchical clustering structure by iteratively combining two most similar class clusters, and iterating until the total number of the class clusters is less than or equal to a preset value N;

step S22, the similarity between the class clusters can be calculated by combining metadata information and content sampling data, wherein the content sampling data is a part of specific content samples extracted from fields or files, and the similarity between the class clusters can be calculated by comprehensively considering the information and is used for determining the two most similar class clusters to be combined;

step S23, balancing a cluster tree formed by clustering.

Further, the step S21 further includes: the maximum similarity between clusters is smaller than or equal to a preset value θ1, and the step S22 further includes: the metadata information includes attributes, types, association relations of fields or files.

Further, a constraint principle needs to be followed in the clustering process, and the constraint principle is as follows: a class cluster contains a minimum number of subclasses greater than M, and the class cluster can no longer participate in merging.

Further, the step S3 further includes the following steps:

step S31, forming a cluster tree in the clustering process, and pruning the cluster tree to form a classification catalog with a multi-layer tree structure;

step S32, for each class cluster in the pruned cluster tree, semantic analysis and feature extraction are carried out on text contents of the included subclasses by utilizing a natural language processing technology, keyword, theme or concept information is obtained, meanwhile, metadata of fields or files are considered, common information of the included subclasses is screened out from the information, and a father class name with high descriptive and accuracy is generated based on the common information and used for identifying and organizing the corresponding class clusters to form a classification catalog with clear hierarchical relation and convenient understanding.

Further, the step S31 further includes a trimming process, where the trimming process is: traversing the class clusters in the cluster tree from bottom to top, if the similarity between the minimum subclasses in one class cluster is larger than theta 2, tiling the minimum subclasses, removing the class clusters formed by combining the minimum subclasses in the middle of the minimum subclasses, and taking the minimum subclasses as direct subclasses.

Further, the step S4 includes:

step S41, classifying each leaf in the classification catalog, training a machine learning model by using metadata information and content data contained in the leaf as a recognition rule of the subclass, and constructing a model so as to judge which subclass the new data belongs to according to the input data;

step S42, in the process of training one leaf classification, randomly selecting a certain proportion of samples from the class as positive samples, randomly selecting a certain proportion of samples from other leaf classifications as negative samples for training, so that the model learns the characteristics and rules of each leaf classification, and correctly classifies new data from other leaf classifications;

and S43, after training is completed, predicting other classifications by using the trained models, marking the samples with the mispredicted samples as negative samples, training again, repeating the steps until the misprediction rate of the other classifications reaches a satisfactory level or reaches a preset stop condition, and repeatedly training and adjusting the models to iteratively train the easily confused classifications or the newly appeared classifications by continuously introducing the mispredicted samples.

Further, the step S5 includes: the classification catalog and the identification rule generated by the process are combined into a complete classification template, and the template is applied to a new data source to classify and classify the new data source.

Further, the various fields or files in the step S21 are: fields in a database table, text documents in a file system, and spreadsheets.

The invention has the beneficial effects that:

automatically generating a data classification grading template: the invention provides a method for automatically generating a classification and grading template, which comprehensively considers metadata information and content data by extracting the metadata information and the content sampling data from a preset data source, combines hierarchical clustering and natural language processing technology to generate a classification catalog and a classification name with high descriptive and accuracy, and simultaneously trains a machine learning model to generate a recognition rule of leaf classification so as to form a complete classification and grading template.

And the recognition accuracy is improved: in the process of training a machine learning model to generate leaf classification recognition rules, after one iteration training is completed, other classifications are predicted by using the model, and samples with misprediction are marked as negative samples and are trained again. By continuously introducing samples with wrong predictions as negative samples for training, the accuracy of classification can be improved, which makes the classification result more reliable and accurate.

Flexibility: for each leaf classification, the invention trains a machine learning model as an identification rule thereof, and does not influence the existing classification when other classifications are added or deleted. Therefore, the template can be adjusted and optimized according to specific requirements so as to adapt to the classification requirements of different fields and scenes.

High efficiency: the method can quickly generate the classification template by reading the preset data source and combining hierarchical clustering and machine learning technology, and can be applied to the target data source to realize efficient classification.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of the present invention for generating a classification hierarchical template;

FIG. 2 is a hierarchical clustering flow chart with constraints of the present invention;

FIG. 3 is a flow chart of training recognition rules according to the present invention.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

Referring to fig. 1-3, a method for automatically generating a data classification hierarchical template, the method comprising the steps of:

s1, reading a preset data source;

s2, carrying out hierarchical clustering with constraint;

s3, forming a classification catalog of a multi-layer tree structure;

and S5, applying the classification to the target data source.

The step S1 includes: the data source comprises a field or a file, and when the data source is read, metadata information and content sampling data in the data source are acquired, so that a data basis is provided for the subsequent generation of the classification and grading template.

The step S2 further includes the following steps:

step S23, balancing a cluster tree formed by clustering.

The step S21 further includes: the maximum similarity between clusters is smaller than or equal to a preset value θ1, and step S22 further includes: the metadata information includes attributes, types, association relations of fields or files.

In the clustering process, a constraint principle needs to be followed, wherein the constraint principle is as follows: a class cluster contains a minimum number of subclasses greater than M, and the class cluster can no longer participate in merging.

The step S3 further includes the following steps:

step S32, for each class cluster in the pruned cluster tree, semantic analysis and feature extraction are carried out on text contents of the included subclasses (fields or files) by utilizing a natural language processing technology, keyword, theme or concept information is obtained, metadata of the fields or files such as labels, descriptions, types and the like are considered, common information of the included subclasses is screened out from the information, and a descriptive and high-accuracy parent class name is generated based on the common information and used for identifying and organizing the corresponding class clusters to form a classification catalog with clear hierarchical relation and convenient understanding.

The step S31 further includes a trimming process, where the trimming process is: traversing the class clusters in the cluster tree from bottom to top, if the similarity between the minimum subclasses in one class cluster is larger than theta 2, tiling the minimum subclasses, removing the class clusters formed by combining the minimum subclasses in the middle of the minimum subclasses, and taking the minimum subclasses as direct subclasses.

The step S4 includes:

and S43, after the training is finished, predicting other classifications by using the trained models, marking the samples with the mispredicted samples as negative samples, training again, repeating the steps until the misprediction rate of the other classifications reaches a satisfactory level or reaches a preset stop condition, and repeatedly training and adjusting the models, particularly for the easily confused classifications or the newly appeared classifications, and carrying out iterative training by continuously introducing the mispredicted samples, thereby improving the accuracy and generalization capability of the models.

The step S5 includes: the classification catalog and the identification rule generated by the process are combined into a complete classification template, and the template is applied to a new data source to classify and classify the new data source.

The various fields or files in step S21 are: fields in database tables, text documents and spreadsheets in file systems, and the like.

Working principle: the method comprises the following steps:

step 1, assuming a relational database with better structural definition as a preset data source, the system reads metadata information and content sampling data of fields contained in the database by connecting the database. Wherein the metadata information includes library names and notes, table names and notes, and names, notes, types of fields, etc. The content sample data is obtained by randomly sampling the fields.

And 2, iteratively combining the two most similar class clusters by taking each field as an initial class cluster until the total number of the class clusters is less than or equal to 5 or the maximum similarity between the class clusters is less than or equal to 30 percent. When the similarity is calculated, preprocessing operations such as word segmentation, stop word removal and the like are performed on text data such as library/table/field names and notes and content sampling data. And then using a pre-trained Word2Vec model to weight and average Word vectors of all the pre-processed words, taking the obtained vectors as class cluster vectors, and calculating cosine similarity to obtain similarity among class clusters. A constraint also needs to be followed in the clustering process: if a cluster contains a minimum sub-class number greater than 20, merging cannot be participated.

And 3, pruning a cluster tree formed in the clustering process in the step 2, traversing class clusters in the cluster tree from bottom to top, and if the similarity between the minimum subclasses in one class cluster is greater than 60%, tiling the minimum subclasses, namely removing the class clusters formed by combining the minimum subclasses in the middle of the minimum subclasses, and taking all the minimum subclasses as direct subclasses.

And 4, extracting a named entity from the text content of the included field or file by using a Named Entity Recognition (NER) technology according to each class cluster in the cluster tree after pruning in the step 3, counting word frequency, and selecting the named entity with high word frequency as the name of the class cluster and as one class of the class catalog. All class clusters are organized in a hierarchical relationship to form a complete classification directory.

And 5, randomly selecting a certain proportion of samples from the class as positive samples according to each minimum subclass in the classification catalog generated in the step 4, and randomly selecting a certain proportion of samples from other classifications as negative samples to perform textCNN model training, wherein the trained model is used as the recognition rule of the classification. Still further, for confusable classifications, the interactions are further trained as negative examples to improve recognition accuracy.

And 6, combining the classification catalogue and the recognition rule generated in the process to form a complete classification template. The template is applied to a new data source, and can be classified and graded.

While the fundamental and principal features of the invention and advantages of the invention have been shown and described, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing exemplary embodiments, but may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. A method for automatically generating a data classification hierarchical template, characterized in that: the method comprises the following steps:

s1, reading a preset data source;

s2, carrying out hierarchical clustering with constraint;

s3, forming a classification catalog of a multi-layer tree structure;

and S5, applying the classification to the target data source.

2. A method of automatically generating a data classification hierarchical template in accordance with claim 1, wherein: the step S1 includes: the data source comprises a field or a file, and when the data source is read, metadata information and content sampling data in the data source are acquired, so that a data basis is provided for the subsequent generation of the classification and grading template.

3. A method of automatically generating a data classification hierarchical template in accordance with claim 2, wherein: the step S2 further includes the following steps:

step S23, balancing a cluster tree formed by clustering.

4. A method of automatically generating a data classification hierarchical template in accordance with claim 3 wherein: the step S21 further includes: the maximum similarity between clusters is smaller than or equal to a preset value θ1, and the step S22 further includes: the metadata information includes attributes, types, association relations of fields or files.

5. The method for automatically generating a data classification hierarchical template in accordance with claim 4, wherein: in the clustering process, a constraint principle needs to be followed, wherein the constraint principle is as follows: a class cluster contains a minimum number of subclasses greater than M, and the class cluster can no longer participate in merging.

6. A method of automatically generating a data classification hierarchical template in accordance with claim 5, wherein: the step S3 further includes the following steps:

7. A method of automatically generating a data classification hierarchical template in accordance with claim 6 wherein: the step S31 further includes a trimming process, where the trimming process is: traversing the class clusters in the cluster tree from bottom to top, if the similarity between the minimum subclasses in one class cluster is larger than theta 2, tiling the minimum subclasses, removing the class clusters formed by combining the minimum subclasses in the middle of the minimum subclasses, and taking the minimum subclasses as direct subclasses.

8. A method of automatically generating a data classification hierarchical template in accordance with claim 7, wherein: the step S4 includes:

9. A method of automatically generating a data classification ranking template in accordance with claim 8, wherein: the step S5 includes: the classification catalog and the identification rule generated by the process are combined into a complete classification template, and the template is applied to a new data source to classify and classify the new data source.

10. A method of automatically generating a data classification hierarchical template in accordance with claim 3 wherein: the various fields or files in the step S21 are: fields in a database table, text documents in a file system, and spreadsheets.