CN117851860A - Method for automatically generating data classification grading template - Google Patents

Method for automatically generating data classification grading template Download PDF

Info

Publication number
CN117851860A
CN117851860A CN202311712760.0A CN202311712760A CN117851860A CN 117851860 A CN117851860 A CN 117851860A CN 202311712760 A CN202311712760 A CN 202311712760A CN 117851860 A CN117851860 A CN 117851860A
Authority
CN
China
Prior art keywords
classification
data
template
class
subclasses
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311712760.0A
Other languages
Chinese (zh)
Inventor
全锦琪
李韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Cloud Technology Co Ltd
Original Assignee
Tianyi Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Cloud Technology Co Ltd filed Critical Tianyi Cloud Technology Co Ltd
Priority to CN202311712760.0A priority Critical patent/CN117851860A/en
Publication of CN117851860A publication Critical patent/CN117851860A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23211Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for automatically generating a data classification grading template, which comprises the following steps: s1, reading a preset data source; s2, carrying out hierarchical clustering with constraint; s3, forming a classification catalog of a multi-layer tree structure; s4, training a machine learning model as a recognition rule of leaf classification; and S5, applying the classification to the target data source. According to the invention, metadata information and content sampling data are extracted from a preset data source, the metadata information and the content data are comprehensively considered, a hierarchical clustering and natural language processing technology are combined to generate a classification catalog and a classification name with high descriptive and accuracy, and meanwhile, a machine learning model is trained to generate a leaf classification recognition rule, so that a complete classification template is formed.

Description

Method for automatically generating data classification grading template
Technical Field
The invention relates to the technical field of data classification and grading, in particular to a method for automatically generating a data classification and grading template.
Background
In the current information age, the amount and complexity of data grows exponentially, management and security of data face increasing challenges, and classifying and ranking data and more efficient protection are important tasks. By classifying the data, the data with similarity is assigned to the same category, and the data resources can be better organized and managed. By ranking the data, appropriate access rights and security control measures can be determined according to the category, degree of sensitivity and importance to which it belongs.
To implement data classification, it is generally necessary to construct a classification template. The classification hierarchy template typically contains a hierarchical tree structured classification directory in which each leaf classification contains specific identification rules for identifying whether the data belongs to that class. After the classification and grading template is constructed, the classification and grading template can be applied to the identification and classification and grading of the data source data.
It is noted that the construction of the classification hierarchical template is generally performed manually at present, and a professional is required to define classification standards and recognition rules of data according to field knowledge and experience. However, as data size and complexity increase, manual design and construction of templates and recognition rules becomes difficult and time consuming, and subject to subjective bias and error.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention aims to provide a method for automatically generating a data classification template, which can improve the efficiency of constructing the classification template and the identification accuracy.
In order to achieve the above object, the present invention is realized by the following technical scheme: a method of automatically generating a data classification ranking template, the method comprising the steps of:
s1, reading a preset data source;
s2, carrying out hierarchical clustering with constraint;
s3, forming a classification catalog of a multi-layer tree structure;
s4, training a machine learning model as a recognition rule of leaf classification;
and S5, applying the classification to the target data source.
Further, the step S1 includes: the data source comprises a field or a file, and when the data source is read, metadata information and content sampling data in the data source are acquired, so that a data basis is provided for the subsequent generation of the classification and grading template.
Further, the step S2 further includes the following steps:
s21, presetting a field or a file in a data source as an initial class cluster, forming a hierarchical clustering structure by iteratively combining two most similar class clusters, and iterating until the total number of the class clusters is less than or equal to a preset value N;
step S22, the similarity between the class clusters can be calculated by combining metadata information and content sampling data, wherein the content sampling data is a part of specific content samples extracted from fields or files, and the similarity between the class clusters can be calculated by comprehensively considering the information and is used for determining the two most similar class clusters to be combined;
step S23, balancing a cluster tree formed by clustering.
Further, the step S21 further includes: the maximum similarity between clusters is smaller than or equal to a preset value θ1, and the step S22 further includes: the metadata information includes attributes, types, association relations of fields or files.
Further, a constraint principle needs to be followed in the clustering process, and the constraint principle is as follows: a class cluster contains a minimum number of subclasses greater than M, and the class cluster can no longer participate in merging.
Further, the step S3 further includes the following steps:
step S31, forming a cluster tree in the clustering process, and pruning the cluster tree to form a classification catalog with a multi-layer tree structure;
step S32, for each class cluster in the pruned cluster tree, semantic analysis and feature extraction are carried out on text contents of the included subclasses by utilizing a natural language processing technology, keyword, theme or concept information is obtained, meanwhile, metadata of fields or files are considered, common information of the included subclasses is screened out from the information, and a father class name with high descriptive and accuracy is generated based on the common information and used for identifying and organizing the corresponding class clusters to form a classification catalog with clear hierarchical relation and convenient understanding.
Further, the step S31 further includes a trimming process, where the trimming process is: traversing the class clusters in the cluster tree from bottom to top, if the similarity between the minimum subclasses in one class cluster is larger than theta 2, tiling the minimum subclasses, removing the class clusters formed by combining the minimum subclasses in the middle of the minimum subclasses, and taking the minimum subclasses as direct subclasses.
Further, the step S4 includes:
step S41, classifying each leaf in the classification catalog, training a machine learning model by using metadata information and content data contained in the leaf as a recognition rule of the subclass, and constructing a model so as to judge which subclass the new data belongs to according to the input data;
step S42, in the process of training one leaf classification, randomly selecting a certain proportion of samples from the class as positive samples, randomly selecting a certain proportion of samples from other leaf classifications as negative samples for training, so that the model learns the characteristics and rules of each leaf classification, and correctly classifies new data from other leaf classifications;
and S43, after training is completed, predicting other classifications by using the trained models, marking the samples with the mispredicted samples as negative samples, training again, repeating the steps until the misprediction rate of the other classifications reaches a satisfactory level or reaches a preset stop condition, and repeatedly training and adjusting the models to iteratively train the easily confused classifications or the newly appeared classifications by continuously introducing the mispredicted samples.
Further, the step S5 includes: the classification catalog and the identification rule generated by the process are combined into a complete classification template, and the template is applied to a new data source to classify and classify the new data source.
Further, the various fields or files in the step S21 are: fields in a database table, text documents in a file system, and spreadsheets.
The invention has the beneficial effects that:
automatically generating a data classification grading template: the invention provides a method for automatically generating a classification and grading template, which comprehensively considers metadata information and content data by extracting the metadata information and the content sampling data from a preset data source, combines hierarchical clustering and natural language processing technology to generate a classification catalog and a classification name with high descriptive and accuracy, and simultaneously trains a machine learning model to generate a recognition rule of leaf classification so as to form a complete classification and grading template.
And the recognition accuracy is improved: in the process of training a machine learning model to generate leaf classification recognition rules, after one iteration training is completed, other classifications are predicted by using the model, and samples with misprediction are marked as negative samples and are trained again. By continuously introducing samples with wrong predictions as negative samples for training, the accuracy of classification can be improved, which makes the classification result more reliable and accurate.
Flexibility: for each leaf classification, the invention trains a machine learning model as an identification rule thereof, and does not influence the existing classification when other classifications are added or deleted. Therefore, the template can be adjusted and optimized according to specific requirements so as to adapt to the classification requirements of different fields and scenes.
High efficiency: the method can quickly generate the classification template by reading the preset data source and combining hierarchical clustering and machine learning technology, and can be applied to the target data source to realize efficient classification.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:
FIG. 1 is a flow chart of the present invention for generating a classification hierarchical template;
FIG. 2 is a hierarchical clustering flow chart with constraints of the present invention;
FIG. 3 is a flow chart of training recognition rules according to the present invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
Referring to fig. 1-3, a method for automatically generating a data classification hierarchical template, the method comprising the steps of:
s1, reading a preset data source;
s2, carrying out hierarchical clustering with constraint;
s3, forming a classification catalog of a multi-layer tree structure;
s4, training a machine learning model as a recognition rule of leaf classification;
and S5, applying the classification to the target data source.
The step S1 includes: the data source comprises a field or a file, and when the data source is read, metadata information and content sampling data in the data source are acquired, so that a data basis is provided for the subsequent generation of the classification and grading template.
The step S2 further includes the following steps:
s21, presetting a field or a file in a data source as an initial class cluster, forming a hierarchical clustering structure by iteratively combining two most similar class clusters, and iterating until the total number of the class clusters is less than or equal to a preset value N;
step S22, the similarity between the class clusters can be calculated by combining metadata information and content sampling data, wherein the content sampling data is a part of specific content samples extracted from fields or files, and the similarity between the class clusters can be calculated by comprehensively considering the information and is used for determining the two most similar class clusters to be combined;
step S23, balancing a cluster tree formed by clustering.
The step S21 further includes: the maximum similarity between clusters is smaller than or equal to a preset value θ1, and step S22 further includes: the metadata information includes attributes, types, association relations of fields or files.
In the clustering process, a constraint principle needs to be followed, wherein the constraint principle is as follows: a class cluster contains a minimum number of subclasses greater than M, and the class cluster can no longer participate in merging.
The step S3 further includes the following steps:
step S31, forming a cluster tree in the clustering process, and pruning the cluster tree to form a classification catalog with a multi-layer tree structure;
step S32, for each class cluster in the pruned cluster tree, semantic analysis and feature extraction are carried out on text contents of the included subclasses (fields or files) by utilizing a natural language processing technology, keyword, theme or concept information is obtained, metadata of the fields or files such as labels, descriptions, types and the like are considered, common information of the included subclasses is screened out from the information, and a descriptive and high-accuracy parent class name is generated based on the common information and used for identifying and organizing the corresponding class clusters to form a classification catalog with clear hierarchical relation and convenient understanding.
The step S31 further includes a trimming process, where the trimming process is: traversing the class clusters in the cluster tree from bottom to top, if the similarity between the minimum subclasses in one class cluster is larger than theta 2, tiling the minimum subclasses, removing the class clusters formed by combining the minimum subclasses in the middle of the minimum subclasses, and taking the minimum subclasses as direct subclasses.
The step S4 includes:
step S41, classifying each leaf in the classification catalog, training a machine learning model by using metadata information and content data contained in the leaf as a recognition rule of the subclass, and constructing a model so as to judge which subclass the new data belongs to according to the input data;
step S42, in the process of training one leaf classification, randomly selecting a certain proportion of samples from the class as positive samples, randomly selecting a certain proportion of samples from other leaf classifications as negative samples for training, so that the model learns the characteristics and rules of each leaf classification, and correctly classifies new data from other leaf classifications;
and S43, after the training is finished, predicting other classifications by using the trained models, marking the samples with the mispredicted samples as negative samples, training again, repeating the steps until the misprediction rate of the other classifications reaches a satisfactory level or reaches a preset stop condition, and repeatedly training and adjusting the models, particularly for the easily confused classifications or the newly appeared classifications, and carrying out iterative training by continuously introducing the mispredicted samples, thereby improving the accuracy and generalization capability of the models.
The step S5 includes: the classification catalog and the identification rule generated by the process are combined into a complete classification template, and the template is applied to a new data source to classify and classify the new data source.
The various fields or files in step S21 are: fields in database tables, text documents and spreadsheets in file systems, and the like.
Working principle: the method comprises the following steps:
step 1, assuming a relational database with better structural definition as a preset data source, the system reads metadata information and content sampling data of fields contained in the database by connecting the database. Wherein the metadata information includes library names and notes, table names and notes, and names, notes, types of fields, etc. The content sample data is obtained by randomly sampling the fields.
And 2, iteratively combining the two most similar class clusters by taking each field as an initial class cluster until the total number of the class clusters is less than or equal to 5 or the maximum similarity between the class clusters is less than or equal to 30 percent. When the similarity is calculated, preprocessing operations such as word segmentation, stop word removal and the like are performed on text data such as library/table/field names and notes and content sampling data. And then using a pre-trained Word2Vec model to weight and average Word vectors of all the pre-processed words, taking the obtained vectors as class cluster vectors, and calculating cosine similarity to obtain similarity among class clusters. A constraint also needs to be followed in the clustering process: if a cluster contains a minimum sub-class number greater than 20, merging cannot be participated.
And 3, pruning a cluster tree formed in the clustering process in the step 2, traversing class clusters in the cluster tree from bottom to top, and if the similarity between the minimum subclasses in one class cluster is greater than 60%, tiling the minimum subclasses, namely removing the class clusters formed by combining the minimum subclasses in the middle of the minimum subclasses, and taking all the minimum subclasses as direct subclasses.
And 4, extracting a named entity from the text content of the included field or file by using a Named Entity Recognition (NER) technology according to each class cluster in the cluster tree after pruning in the step 3, counting word frequency, and selecting the named entity with high word frequency as the name of the class cluster and as one class of the class catalog. All class clusters are organized in a hierarchical relationship to form a complete classification directory.
And 5, randomly selecting a certain proportion of samples from the class as positive samples according to each minimum subclass in the classification catalog generated in the step 4, and randomly selecting a certain proportion of samples from other classifications as negative samples to perform textCNN model training, wherein the trained model is used as the recognition rule of the classification. Still further, for confusable classifications, the interactions are further trained as negative examples to improve recognition accuracy.
And 6, combining the classification catalogue and the recognition rule generated in the process to form a complete classification template. The template is applied to a new data source, and can be classified and graded.
While the fundamental and principal features of the invention and advantages of the invention have been shown and described, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing exemplary embodiments, but may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims (10)

1. A method for automatically generating a data classification hierarchical template, characterized in that: the method comprises the following steps:
s1, reading a preset data source;
s2, carrying out hierarchical clustering with constraint;
s3, forming a classification catalog of a multi-layer tree structure;
s4, training a machine learning model as a recognition rule of leaf classification;
and S5, applying the classification to the target data source.
2. A method of automatically generating a data classification hierarchical template in accordance with claim 1, wherein: the step S1 includes: the data source comprises a field or a file, and when the data source is read, metadata information and content sampling data in the data source are acquired, so that a data basis is provided for the subsequent generation of the classification and grading template.
3. A method of automatically generating a data classification hierarchical template in accordance with claim 2, wherein: the step S2 further includes the following steps:
s21, presetting a field or a file in a data source as an initial class cluster, forming a hierarchical clustering structure by iteratively combining two most similar class clusters, and iterating until the total number of the class clusters is less than or equal to a preset value N;
step S22, the similarity between the class clusters can be calculated by combining metadata information and content sampling data, wherein the content sampling data is a part of specific content samples extracted from fields or files, and the similarity between the class clusters can be calculated by comprehensively considering the information and is used for determining the two most similar class clusters to be combined;
step S23, balancing a cluster tree formed by clustering.
4. A method of automatically generating a data classification hierarchical template in accordance with claim 3 wherein: the step S21 further includes: the maximum similarity between clusters is smaller than or equal to a preset value θ1, and the step S22 further includes: the metadata information includes attributes, types, association relations of fields or files.
5. The method for automatically generating a data classification hierarchical template in accordance with claim 4, wherein: in the clustering process, a constraint principle needs to be followed, wherein the constraint principle is as follows: a class cluster contains a minimum number of subclasses greater than M, and the class cluster can no longer participate in merging.
6. A method of automatically generating a data classification hierarchical template in accordance with claim 5, wherein: the step S3 further includes the following steps:
step S31, forming a cluster tree in the clustering process, and pruning the cluster tree to form a classification catalog with a multi-layer tree structure;
step S32, for each class cluster in the pruned cluster tree, semantic analysis and feature extraction are carried out on text contents of the included subclasses by utilizing a natural language processing technology, keyword, theme or concept information is obtained, meanwhile, metadata of fields or files are considered, common information of the included subclasses is screened out from the information, and a father class name with high descriptive and accuracy is generated based on the common information and used for identifying and organizing the corresponding class clusters to form a classification catalog with clear hierarchical relation and convenient understanding.
7. A method of automatically generating a data classification hierarchical template in accordance with claim 6 wherein: the step S31 further includes a trimming process, where the trimming process is: traversing the class clusters in the cluster tree from bottom to top, if the similarity between the minimum subclasses in one class cluster is larger than theta 2, tiling the minimum subclasses, removing the class clusters formed by combining the minimum subclasses in the middle of the minimum subclasses, and taking the minimum subclasses as direct subclasses.
8. A method of automatically generating a data classification hierarchical template in accordance with claim 7, wherein: the step S4 includes:
step S41, classifying each leaf in the classification catalog, training a machine learning model by using metadata information and content data contained in the leaf as a recognition rule of the subclass, and constructing a model so as to judge which subclass the new data belongs to according to the input data;
step S42, in the process of training one leaf classification, randomly selecting a certain proportion of samples from the class as positive samples, randomly selecting a certain proportion of samples from other leaf classifications as negative samples for training, so that the model learns the characteristics and rules of each leaf classification, and correctly classifies new data from other leaf classifications;
and S43, after training is completed, predicting other classifications by using the trained models, marking the samples with the mispredicted samples as negative samples, training again, repeating the steps until the misprediction rate of the other classifications reaches a satisfactory level or reaches a preset stop condition, and repeatedly training and adjusting the models to iteratively train the easily confused classifications or the newly appeared classifications by continuously introducing the mispredicted samples.
9. A method of automatically generating a data classification ranking template in accordance with claim 8, wherein: the step S5 includes: the classification catalog and the identification rule generated by the process are combined into a complete classification template, and the template is applied to a new data source to classify and classify the new data source.
10. A method of automatically generating a data classification hierarchical template in accordance with claim 3 wherein: the various fields or files in the step S21 are: fields in a database table, text documents in a file system, and spreadsheets.
CN202311712760.0A 2023-12-13 2023-12-13 Method for automatically generating data classification grading template Pending CN117851860A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311712760.0A CN117851860A (en) 2023-12-13 2023-12-13 Method for automatically generating data classification grading template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311712760.0A CN117851860A (en) 2023-12-13 2023-12-13 Method for automatically generating data classification grading template

Publications (1)

Publication Number Publication Date
CN117851860A true CN117851860A (en) 2024-04-09

Family

ID=90537296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311712760.0A Pending CN117851860A (en) 2023-12-13 2023-12-13 Method for automatically generating data classification grading template

Country Status (1)

Country Link
CN (1) CN117851860A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118276793A (en) * 2024-06-04 2024-07-02 江苏达海智能系统股份有限公司 Method and system for collecting facility heterogeneous data for building intellectualization

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118276793A (en) * 2024-06-04 2024-07-02 江苏达海智能系统股份有限公司 Method and system for collecting facility heterogeneous data for building intellectualization
CN118276793B (en) * 2024-06-04 2024-08-13 江苏达海智能系统股份有限公司 Method and system for collecting facility heterogeneous data for building intellectualization

Similar Documents

Publication Publication Date Title
CN110413780B (en) Text emotion analysis method and electronic equipment
CN105677873B (en) Text Intelligence association cluster based on model of the domain knowledge collects processing method
CN110162591B (en) Entity alignment method and system for digital education resources
CN112182148B (en) Standard aided writing method based on full text retrieval
WO2022081812A1 (en) Artificial intelligence driven document analysis, including searching, indexing, comparing or associating datasets based on learned representations
CN117851860A (en) Method for automatically generating data classification grading template
CN115794803B (en) Engineering audit problem monitoring method and system based on big data AI technology
CN113254507A (en) Intelligent construction and inventory method for data asset directory
CN113742396A (en) Mining method and device for object learning behavior pattern
CN117473431A (en) Airport data classification and classification method and system based on knowledge graph
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN118035440A (en) Enterprise associated archive management target knowledge feature recommendation method
CN112286799B (en) Software defect positioning method combining sentence embedding and particle swarm optimization algorithm
CN117573876A (en) Service data classification and classification method and device
CN117111890A (en) Software requirement document analysis method, device and medium
CN113254583B (en) Document marking method, device and medium based on semantic vector
CN111274404B (en) Small sample entity multi-field classification method based on man-machine cooperation
CN114741512A (en) Automatic text classification method and system
Jingliang et al. A data-driven approach based on LDA for identifying duplicate bug report
CN117648635B (en) Sensitive information classification and classification method and system and electronic equipment
CN117544831B (en) Automatic decomposing method and system for classroom teaching links
CN115858738B (en) Enterprise public opinion information similarity identification method
CN117251605B (en) Multi-source data query method and system based on deep learning
CN118377771B (en) Data modeling method and system based on graph data structure
Timuçin et al. Initial seed value effectiveness on performances of data mining algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination