CN118349879A - Data classification grading method based on similarity algorithm - Google Patents
Data classification grading method based on similarity algorithm Download PDFInfo
- Publication number
- CN118349879A CN118349879A CN202410778344.9A CN202410778344A CN118349879A CN 118349879 A CN118349879 A CN 118349879A CN 202410778344 A CN202410778344 A CN 202410778344A CN 118349879 A CN118349879 A CN 118349879A
- Authority
- CN
- China
- Prior art keywords
- classification
- feature
- similarity
- corpus
- feature item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000004140 cleaning Methods 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 12
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 2
- 239000000284 extract Substances 0.000 claims 1
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a data classification and grading method based on a similarity algorithm, which is characterized in that a method for classifying and grading data assets is realized by using the similarity algorithm, a characteristic item library and a similarity algorithm module are introduced, massive metadata are read through a data acquisition module, and after customized data cleaning and standardized preprocessing are carried out on field names and field description contents of the massive metadata, a standard characteristic item library is obtained; and then, carrying out similarity calculation on each feature item and the rule corpus through a similarity algorithm module to obtain a classification and grading result of the feature item, and further obtaining a classification and grading result of data corresponding to the feature item. For different levels of data, users formulate different levels of security policies. According to the technical scheme, the method for classifying and grading the metadata sets rapidly and automatically through the similarity calculation module is adopted, so that labor cost is reduced, and the efficiency and accuracy of classifying and grading the data are improved.
Description
Technical Field
The invention relates to a data classification and grading method based on a similarity algorithm, which is a method for classifying and grading metadata in the known industry based on the similarity algorithm and belongs to the technical field of data security.
Background
In recent years, with the rapid development of digital economy, databases are widely used in various fields, and data has become one of the important assets of various institutions and enterprises. The process of collecting, processing, storing, analyzing and managing massive data faces a plurality of security threats, and the security construction of the data is obviously important. By automatically classifying and grading the data assets, it is important to identify core data assets, important data assets and general data assets, grasp the current situation of the data assets and build a targeted safety protection system, so that the accuracy and the integrity of the data are ensured.
The traditional classification and grading work mainly depends on manpower, and has the defects of low efficiency, strong subjectivity and the like. How to automatically, accurately and quickly classify and classify metadata in a database is a current challenge to be solved. Therefore, a new solution is urgently needed to solve the above technical problems.
Disclosure of Invention
The invention provides a data classification and classification method based on a similarity algorithm, which aims at the problems in the prior art, and the technical scheme provides a method for carrying out rapid and automatic classification and classification on a metadata set through a similarity calculation module after carrying out data preprocessing on massive metadata in the known industry, so that the labor cost is reduced, and the efficiency and accuracy of data classification and classification are improved.
In order to achieve the above object, the present invention provides a data classification and classification method based on a similarity algorithm, the classification method comprising
The data acquisition module acquires characteristic contents of mass metadata, including field names and field description contents:
When the system accesses massive metadata, the field names and field description contents of the metadata are automatically extracted.
The feature item module is used for carrying out data cleaning and pretreatment on the feature content to obtain a standard feature item library, and carrying out statistics and merging treatment on the standard feature item library, and specifically comprises the following steps:
firstly, carrying out customized data cleaning on field names and field description contents, including unified case, unified punctuation mark format, deleting special symbols and contents, deleting enumeration values and the like, so as to obtain cleaned pre-characteristic item libraries and alarm name information; and secondly, carrying out standardized processing on the pre-characteristic item library, including deleting pre-characteristic items such as single characters, numbers, blank characters and the like, so as to obtain a standard characteristic item library.
Firstly, counting the number of the same feature item names in a standard feature item library; and secondly, carrying out special merging processing on field names with the same characteristic item, wherein the field names are used as matching contents corresponding to the characteristic item, so that a user can select to match metadata through the characteristic item or select to match metadata through the matching contents, and the type of the metadata is identified.
The similarity algorithm module is used for carrying out similarity one-to-one calculation on the feature items output by the feature item module and a rule corpus in the known industry to obtain a corpus with highest similarity and larger than a preset threshold value, and taking a classification grading result of the corpus as a result of the feature items:
Dynamically reading classification grading rules of known industries, extracting a corpus with calculated similarity, segmenting words from the corpus, converting the word into word bag vectors, creating a TF-IDF model, converting the word bag vectors into TF-IDF vectors, and finally creating a similarity matrix of the corpus;
And searching the feature item text, converting the feature item text into a TF-IDF vector, finally, calculating the similarity between the feature item text and the text of the rule corpus, judging whether the highest similarity value is larger than a preset threshold value, and if so, reading the classification and grading result of the corresponding corpus as the classification and grading result of the feature item.
The method comprises the following steps:
step 101, accessing a mass metadata into a system, and extracting a field name and a field description value through a data acquisition module;
step 102, carrying out customized data cleaning on field names and field descriptions to obtain a pre-characteristic item library;
step 103, carrying out standardization processing on the pre-characteristic item library to obtain a standard characteristic item library;
step 104, counting the number of each feature item in the standard feature item library, merging the field content of each feature item, and de-duplicating the feature item;
step 105, similarity calculation is carried out on each feature item and the rule corpus, and the highest similarity value and the corresponding corpus are output;
Step 106, judging whether the similarity value is larger than a preset threshold value, if so, executing step 107, otherwise, executing step 108;
step 107, reading the classification and grading result of the corpus corresponding to the highest similarity value as the classification and grading result of the feature item, and further obtaining the classification and grading result of the feature item matching content (namely data);
step 108, manual intervention is performed to manually classify and classify the feature items, so as to obtain classification and classification results of the matching content (namely data) of the feature items.
Compared with the prior art, the method has the following advantages that 1, the technical scheme introduces the characteristic item concept to identify that the metadata specifically belongs to certain type of data and mainly originates from field description content of the metadata. After massive metadata are read into the system through the data acquisition module, the system performs customized data cleaning and preprocessing on field description contents, and a standard feature item library is extracted, wherein the feature item library is the basis for classifying and grading the metadata. The data cleaning work can be flexibly defined according to different requirements, so that the user interactivity is enhanced, and the user experience is improved; 2. the scheme introduces a similarity algorithm concept to calculate the similarity value of the feature item and a classification rule corpus of the known industry. And carrying out similarity algorithm calculation on the feature items and the classification and grading rule corpus one by one to obtain a highest similarity value, and automatically obtaining a classification and grading result of the feature items when the highest similarity value exceeds a preset threshold value. The user can customize the similarity threshold according to the requirement, the larger the threshold is, the higher the accuracy of the classification and grading result is, and the manual auditing cost is greatly reduced; the smaller the threshold, the higher the coverage of the classification grading result, but the accuracy will be reduced.
Drawings
Fig. 1 is a process flow diagram of a data classification and ranking method based on a similarity algorithm of the present invention.
Detailed Description
In order to enhance the understanding of the present invention, the present embodiment will be described in detail with reference to the accompanying drawings.
Example 1: referring to fig. 1, a data classification and ranking method based on a similarity algorithm includes
The data acquisition module is mainly responsible for acquiring characteristic contents of mass metadata, and comprises field names, field descriptions, library names, table descriptions and field content sampling values: when the system is accessed into mass metadata, all relevant information of the metadata is automatically extracted; when the system accesses massive metadata, the field names and field description contents of the metadata are automatically extracted.
The feature item module is used for carrying out data cleaning and pretreatment on the feature content to obtain a standard feature item library, and carrying out statistics and merging treatment on the standard feature item library, and specifically comprises the following steps:
Firstly, carrying out customized data cleaning on field names and field description contents, wherein the field name preprocessing mainly comprises case-case conversion, pure digital name deletion, single-letter name deletion and enumerated digital deletion; the field description preprocessing mainly comprises unifying punctuation marks, deleting contents after the punctuation marks are designated, deleting enumeration numbers, deleting designated symbols, deleting non-Chinese values and the like, and a cleaned pre-characteristic item library and alarm name information are obtained; and secondly, carrying out standardized processing on the pre-characteristic item library, including deleting single characters, numbers, blank characters and the like, so as to obtain a standard characteristic item library.
Defining a mapping relation between a source word and a target word in the synonym library, wherein the target word is unified with characteristic item words of classification and grading of the known industry; and carrying out synonym replacement on the feature items of the standard feature item library.
Firstly, counting the number of the same feature item names in a standard feature item library; and secondly, carrying out special merging processing on field names with the same characteristic item, wherein the field names are used as matching contents corresponding to the characteristic item, so that a user can select to match metadata through the characteristic item or select to match metadata through the matching contents, and the type of the metadata is identified.
The similarity algorithm module is used for carrying out similarity one-to-one calculation on the feature items output by the feature item module and a rule corpus in the known industry to obtain a corpus with highest similarity and larger than a preset threshold value, and taking a classification grading result of the corpus as a result of the feature items:
dynamically reading classification grading rules of known industries, and extracting feature item names in the rules as a text set for similarity calculation;
preprocessing a text set: performing word segmentation processing on each characteristic item, deleting stop words and the like, and obtaining a two-dimensional list of the text set;
Constructing word bag vectors, creating TF-IDF models and transmitting the TF-IDF models into a corpus for training;
Converting the word bag vector into a new vector corpus through a trained TF-IDF model, and constructing a cosine similarity index;
And searching the feature item text, converting the feature item text into a TF-IDF vector, calculating cosine similarity between the feature item text and the rule corpus text through cosine similarity index, judging whether the highest similarity value is larger than a preset threshold value, and if so, reading the corpus feature item as a result.
And the classification and grading result mapping module maps the corpus feature items with highest similarity into classification and grading results in the database one by one, and then the result of the standard feature items is obtained.
The method comprises the following steps:
step 101, accessing a mass metadata into a system, and extracting a field name and a field description value through a data acquisition module;
step 102, carrying out customized data cleaning on field names and field descriptions to obtain a pre-characteristic item library;
step 103, carrying out standardization processing on the pre-characteristic item library to obtain a standard characteristic item library;
step 104, counting the number of each feature item in the standard feature item library, merging the field content of each feature item, and de-duplicating the feature item;
step 105, similarity calculation is carried out on each feature item and the rule corpus, and the highest similarity value and the corresponding corpus are output;
Step 106, judging whether the similarity value is larger than a preset threshold value, if so, executing step 107, otherwise, executing step 108;
step 107, reading the classification and grading result of the corpus corresponding to the highest similarity value as the classification and grading result of the feature item, and further obtaining the classification and grading result of the feature item matching content (namely data);
step 108, manual intervention is performed to manually classify and classify the feature items, so as to obtain classification and classification results of the matching content (namely data) of the feature items.
It should be noted that the above-mentioned embodiments are not intended to limit the scope of the present invention, and equivalent changes or substitutions made on the basis of the above-mentioned technical solutions fall within the scope of the present invention as defined in the claims.
Claims (4)
1. A data classification and classification method based on a similarity algorithm is characterized by comprising the following steps of
The data acquisition module acquires characteristic contents of massive metadata, including field names and field description contents, and automatically extracts the field names and the field description contents of the metadata after the system is accessed into the massive metadata;
The feature item module is used for carrying out data cleaning and pretreatment on the feature content to obtain a standard feature item library, and carrying out statistics and merging treatment on the standard feature item library;
The similarity algorithm module carries out similarity one by one calculation on the standard feature items output by the feature item module and a rule corpus of the known industry to obtain corpus feature items with highest similarity and larger than a preset threshold value, and the corpus feature items are used as matching results of the standard feature items,
And the classification and grading result mapping module maps the corpus feature items with highest similarity into classification and grading results in the database one by one, and then the result of the standard feature items is obtained.
2. The data classification and classification method based on similarity algorithm according to claim 1, wherein the feature item module performs data cleaning and preprocessing on feature content to obtain a standard feature item library, and performs statistics and merging processing on the standard feature item library, and specifically comprises the following steps:
Firstly, carrying out customized data cleaning on field names and field description contents, wherein the customized data cleaning comprises unified case, unified punctuation mark format, deleting symbols and contents, deleting enumeration values, and obtaining cleaned pre-characteristic item library and alarm name information; secondly, carrying out standardization processing on the pre-characteristic item library, including deleting single character, number, blank character and other pre-characteristic items to obtain a standard characteristic item library,
Firstly, counting the number of the same feature item names in a standard feature item library; secondly, the field names with the same characteristic items are combined, and as matching content corresponding to the characteristic items, the user selects to match metadata through the characteristic items or selects to match metadata through the matching content, so that the type of the metadata is identified.
3. The data classification and classification method based on the similarity algorithm according to claim 1, wherein the similarity algorithm module performs similarity one-to-one calculation on the feature items output by the feature item module and a rule corpus of the known industry to obtain a corpus with highest similarity and larger than a preset threshold value, and takes the classification and classification result as a result of the feature items, and the method specifically comprises the following steps:
Dynamically reading classification grading rules of known industries, extracting a corpus with calculated similarity, segmenting words from the corpus, converting the word into word bag vectors, creating a TF-IDF model, converting the word bag vectors into TF-IDF vectors, and finally creating a similarity matrix of the corpus;
And searching the feature item text, converting the feature item text into a TF-IDF vector, finally, calculating the similarity between the feature item text and the text of the rule corpus, judging whether the highest similarity value is larger than a preset threshold value, and if so, reading the classification and grading result of the corresponding corpus as the classification and grading result of the feature item.
4. The data classification and ranking method based on similarity algorithm according to claim 2, characterized in that the method comprises the steps of:
step 101, accessing a mass metadata into a system, and extracting a field name and a field description value through a data acquisition module;
step 102, carrying out customized data cleaning on field names and field descriptions to obtain a pre-characteristic item library;
step 103, carrying out standardization processing on the pre-characteristic item library to obtain a standard characteristic item library;
step 104, counting the number of each feature item in the standard feature item library, merging the field content of each feature item, and de-duplicating the feature item;
step 105, similarity calculation is carried out on each feature item and the rule corpus, and the highest similarity value and the corresponding corpus are output;
Step 106, judging whether the similarity value is larger than a preset threshold value, if so, executing step 107, otherwise, executing step 108;
Step 107, reading the classification and grading result of the corpus corresponding to the highest similarity value as the classification and grading result of the feature item, and further obtaining the classification and grading result of the feature item matching content, namely data;
And step 108, manually performing classification and grading on the characteristic items, and further obtaining classification and grading results of the matching content of the characteristic items, namely data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410778344.9A CN118349879A (en) | 2024-06-17 | 2024-06-17 | Data classification grading method based on similarity algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410778344.9A CN118349879A (en) | 2024-06-17 | 2024-06-17 | Data classification grading method based on similarity algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118349879A true CN118349879A (en) | 2024-07-16 |
Family
ID=91819792
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410778344.9A Pending CN118349879A (en) | 2024-06-17 | 2024-06-17 | Data classification grading method based on similarity algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118349879A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116628211A (en) * | 2023-07-25 | 2023-08-22 | 中国电信股份有限公司 | Data classification method and device, storage medium and electronic equipment |
CN117236334A (en) * | 2023-10-18 | 2023-12-15 | 贵州电网有限责任公司 | Hierarchical processing method for project data security information |
CN117454220A (en) * | 2023-10-24 | 2024-01-26 | 中国联合网络通信集团有限公司 | Data hierarchical classification method, device, equipment and storage medium |
-
2024
- 2024-06-17 CN CN202410778344.9A patent/CN118349879A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116628211A (en) * | 2023-07-25 | 2023-08-22 | 中国电信股份有限公司 | Data classification method and device, storage medium and electronic equipment |
CN117236334A (en) * | 2023-10-18 | 2023-12-15 | 贵州电网有限责任公司 | Hierarchical processing method for project data security information |
CN117454220A (en) * | 2023-10-24 | 2024-01-26 | 中国联合网络通信集团有限公司 | Data hierarchical classification method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109582861B (en) | Data privacy information detection system | |
CN110826320B (en) | Sensitive data discovery method and system based on text recognition | |
CN110741376B (en) | Automatic document analysis for different natural languages | |
CN112000773B (en) | Search engine technology-based data association relation mining method and application | |
CN111967761A (en) | Monitoring and early warning method and device based on knowledge graph and electronic equipment | |
WO2018160551A1 (en) | Automatic human-emulative document analysis enhancements | |
CN111899090A (en) | Enterprise associated risk early warning method and system | |
CN103034656B (en) | Chapters and sections content layered approach and device, article content layered approach and device | |
CN114722137A (en) | Security policy configuration method and device based on sensitive data identification and electronic equipment | |
CN111680506A (en) | External key mapping method and device of database table, electronic equipment and storage medium | |
CN118116611B (en) | Database construction method based on multi-source medical and nutritional big data fusion integration | |
CN114969467A (en) | Data analysis and classification method and device, computer equipment and storage medium | |
CN114707003B (en) | Method, equipment and storage medium for disambiguating names of paper authors | |
CN116469500A (en) | Data quality control method and system based on post-structuring of medical document | |
CN113505117A (en) | Data quality evaluation method, device, equipment and medium based on data indexes | |
CN113591476A (en) | Data label recommendation method based on machine learning | |
Petrus | Soft and hard clustering for abstract scientific paper in Indonesian | |
CN103034657B (en) | Documentation summary generates method and apparatus | |
CN116680422A (en) | Multi-mode question bank resource duplicate checking method, system, device and storage medium | |
CN109918638B (en) | Network data monitoring method | |
CN111104422A (en) | Training method, device, equipment and storage medium of data recommendation model | |
CN118349879A (en) | Data classification grading method based on similarity algorithm | |
CN114495138A (en) | Intelligent document identification and feature extraction method, device platform and storage medium | |
CN117827991B (en) | Method and system for identifying personal identification information in semi-structured data | |
CN118227684B (en) | Data behavior tracing method based on multidimensional discrete data fingerprint |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |