CN118349879A - Data classification grading method based on similarity algorithm - Google Patents

Data classification grading method based on similarity algorithm Download PDF

Info

Publication number
CN118349879A
CN118349879A CN202410778344.9A CN202410778344A CN118349879A CN 118349879 A CN118349879 A CN 118349879A CN 202410778344 A CN202410778344 A CN 202410778344A CN 118349879 A CN118349879 A CN 118349879A
Authority
CN
China
Prior art keywords
classification
feature
similarity
corpus
feature item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410778344.9A
Other languages
Chinese (zh)
Inventor
糜靖峰
潘兵
于姝
顾欢欢
时昀
卢光青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING SINOVATIO TECHNOLOGY CO LTD
Original Assignee
NANJING SINOVATIO TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING SINOVATIO TECHNOLOGY CO LTD filed Critical NANJING SINOVATIO TECHNOLOGY CO LTD
Priority to CN202410778344.9A priority Critical patent/CN118349879A/en
Publication of CN118349879A publication Critical patent/CN118349879A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data classification and grading method based on a similarity algorithm, which is characterized in that a method for classifying and grading data assets is realized by using the similarity algorithm, a characteristic item library and a similarity algorithm module are introduced, massive metadata are read through a data acquisition module, and after customized data cleaning and standardized preprocessing are carried out on field names and field description contents of the massive metadata, a standard characteristic item library is obtained; and then, carrying out similarity calculation on each feature item and the rule corpus through a similarity algorithm module to obtain a classification and grading result of the feature item, and further obtaining a classification and grading result of data corresponding to the feature item. For different levels of data, users formulate different levels of security policies. According to the technical scheme, the method for classifying and grading the metadata sets rapidly and automatically through the similarity calculation module is adopted, so that labor cost is reduced, and the efficiency and accuracy of classifying and grading the data are improved.

Description

Data classification grading method based on similarity algorithm
Technical Field
The invention relates to a data classification and grading method based on a similarity algorithm, which is a method for classifying and grading metadata in the known industry based on the similarity algorithm and belongs to the technical field of data security.
Background
In recent years, with the rapid development of digital economy, databases are widely used in various fields, and data has become one of the important assets of various institutions and enterprises. The process of collecting, processing, storing, analyzing and managing massive data faces a plurality of security threats, and the security construction of the data is obviously important. By automatically classifying and grading the data assets, it is important to identify core data assets, important data assets and general data assets, grasp the current situation of the data assets and build a targeted safety protection system, so that the accuracy and the integrity of the data are ensured.
The traditional classification and grading work mainly depends on manpower, and has the defects of low efficiency, strong subjectivity and the like. How to automatically, accurately and quickly classify and classify metadata in a database is a current challenge to be solved. Therefore, a new solution is urgently needed to solve the above technical problems.
Disclosure of Invention
The invention provides a data classification and classification method based on a similarity algorithm, which aims at the problems in the prior art, and the technical scheme provides a method for carrying out rapid and automatic classification and classification on a metadata set through a similarity calculation module after carrying out data preprocessing on massive metadata in the known industry, so that the labor cost is reduced, and the efficiency and accuracy of data classification and classification are improved.
In order to achieve the above object, the present invention provides a data classification and classification method based on a similarity algorithm, the classification method comprising
The data acquisition module acquires characteristic contents of mass metadata, including field names and field description contents:
When the system accesses massive metadata, the field names and field description contents of the metadata are automatically extracted.
The feature item module is used for carrying out data cleaning and pretreatment on the feature content to obtain a standard feature item library, and carrying out statistics and merging treatment on the standard feature item library, and specifically comprises the following steps:
firstly, carrying out customized data cleaning on field names and field description contents, including unified case, unified punctuation mark format, deleting special symbols and contents, deleting enumeration values and the like, so as to obtain cleaned pre-characteristic item libraries and alarm name information; and secondly, carrying out standardized processing on the pre-characteristic item library, including deleting pre-characteristic items such as single characters, numbers, blank characters and the like, so as to obtain a standard characteristic item library.
Firstly, counting the number of the same feature item names in a standard feature item library; and secondly, carrying out special merging processing on field names with the same characteristic item, wherein the field names are used as matching contents corresponding to the characteristic item, so that a user can select to match metadata through the characteristic item or select to match metadata through the matching contents, and the type of the metadata is identified.
The similarity algorithm module is used for carrying out similarity one-to-one calculation on the feature items output by the feature item module and a rule corpus in the known industry to obtain a corpus with highest similarity and larger than a preset threshold value, and taking a classification grading result of the corpus as a result of the feature items:
Dynamically reading classification grading rules of known industries, extracting a corpus with calculated similarity, segmenting words from the corpus, converting the word into word bag vectors, creating a TF-IDF model, converting the word bag vectors into TF-IDF vectors, and finally creating a similarity matrix of the corpus;
And searching the feature item text, converting the feature item text into a TF-IDF vector, finally, calculating the similarity between the feature item text and the text of the rule corpus, judging whether the highest similarity value is larger than a preset threshold value, and if so, reading the classification and grading result of the corresponding corpus as the classification and grading result of the feature item.
The method comprises the following steps:
step 101, accessing a mass metadata into a system, and extracting a field name and a field description value through a data acquisition module;
step 102, carrying out customized data cleaning on field names and field descriptions to obtain a pre-characteristic item library;
step 103, carrying out standardization processing on the pre-characteristic item library to obtain a standard characteristic item library;
step 104, counting the number of each feature item in the standard feature item library, merging the field content of each feature item, and de-duplicating the feature item;
step 105, similarity calculation is carried out on each feature item and the rule corpus, and the highest similarity value and the corresponding corpus are output;
Step 106, judging whether the similarity value is larger than a preset threshold value, if so, executing step 107, otherwise, executing step 108;
step 107, reading the classification and grading result of the corpus corresponding to the highest similarity value as the classification and grading result of the feature item, and further obtaining the classification and grading result of the feature item matching content (namely data);
step 108, manual intervention is performed to manually classify and classify the feature items, so as to obtain classification and classification results of the matching content (namely data) of the feature items.
Compared with the prior art, the method has the following advantages that 1, the technical scheme introduces the characteristic item concept to identify that the metadata specifically belongs to certain type of data and mainly originates from field description content of the metadata. After massive metadata are read into the system through the data acquisition module, the system performs customized data cleaning and preprocessing on field description contents, and a standard feature item library is extracted, wherein the feature item library is the basis for classifying and grading the metadata. The data cleaning work can be flexibly defined according to different requirements, so that the user interactivity is enhanced, and the user experience is improved; 2. the scheme introduces a similarity algorithm concept to calculate the similarity value of the feature item and a classification rule corpus of the known industry. And carrying out similarity algorithm calculation on the feature items and the classification and grading rule corpus one by one to obtain a highest similarity value, and automatically obtaining a classification and grading result of the feature items when the highest similarity value exceeds a preset threshold value. The user can customize the similarity threshold according to the requirement, the larger the threshold is, the higher the accuracy of the classification and grading result is, and the manual auditing cost is greatly reduced; the smaller the threshold, the higher the coverage of the classification grading result, but the accuracy will be reduced.
Drawings
Fig. 1 is a process flow diagram of a data classification and ranking method based on a similarity algorithm of the present invention.
Detailed Description
In order to enhance the understanding of the present invention, the present embodiment will be described in detail with reference to the accompanying drawings.
Example 1: referring to fig. 1, a data classification and ranking method based on a similarity algorithm includes
The data acquisition module is mainly responsible for acquiring characteristic contents of mass metadata, and comprises field names, field descriptions, library names, table descriptions and field content sampling values: when the system is accessed into mass metadata, all relevant information of the metadata is automatically extracted; when the system accesses massive metadata, the field names and field description contents of the metadata are automatically extracted.
The feature item module is used for carrying out data cleaning and pretreatment on the feature content to obtain a standard feature item library, and carrying out statistics and merging treatment on the standard feature item library, and specifically comprises the following steps:
Firstly, carrying out customized data cleaning on field names and field description contents, wherein the field name preprocessing mainly comprises case-case conversion, pure digital name deletion, single-letter name deletion and enumerated digital deletion; the field description preprocessing mainly comprises unifying punctuation marks, deleting contents after the punctuation marks are designated, deleting enumeration numbers, deleting designated symbols, deleting non-Chinese values and the like, and a cleaned pre-characteristic item library and alarm name information are obtained; and secondly, carrying out standardized processing on the pre-characteristic item library, including deleting single characters, numbers, blank characters and the like, so as to obtain a standard characteristic item library.
Defining a mapping relation between a source word and a target word in the synonym library, wherein the target word is unified with characteristic item words of classification and grading of the known industry; and carrying out synonym replacement on the feature items of the standard feature item library.
Firstly, counting the number of the same feature item names in a standard feature item library; and secondly, carrying out special merging processing on field names with the same characteristic item, wherein the field names are used as matching contents corresponding to the characteristic item, so that a user can select to match metadata through the characteristic item or select to match metadata through the matching contents, and the type of the metadata is identified.
The similarity algorithm module is used for carrying out similarity one-to-one calculation on the feature items output by the feature item module and a rule corpus in the known industry to obtain a corpus with highest similarity and larger than a preset threshold value, and taking a classification grading result of the corpus as a result of the feature items:
dynamically reading classification grading rules of known industries, and extracting feature item names in the rules as a text set for similarity calculation;
preprocessing a text set: performing word segmentation processing on each characteristic item, deleting stop words and the like, and obtaining a two-dimensional list of the text set;
Constructing word bag vectors, creating TF-IDF models and transmitting the TF-IDF models into a corpus for training;
Converting the word bag vector into a new vector corpus through a trained TF-IDF model, and constructing a cosine similarity index;
And searching the feature item text, converting the feature item text into a TF-IDF vector, calculating cosine similarity between the feature item text and the rule corpus text through cosine similarity index, judging whether the highest similarity value is larger than a preset threshold value, and if so, reading the corpus feature item as a result.
And the classification and grading result mapping module maps the corpus feature items with highest similarity into classification and grading results in the database one by one, and then the result of the standard feature items is obtained.
The method comprises the following steps:
step 101, accessing a mass metadata into a system, and extracting a field name and a field description value through a data acquisition module;
step 102, carrying out customized data cleaning on field names and field descriptions to obtain a pre-characteristic item library;
step 103, carrying out standardization processing on the pre-characteristic item library to obtain a standard characteristic item library;
step 104, counting the number of each feature item in the standard feature item library, merging the field content of each feature item, and de-duplicating the feature item;
step 105, similarity calculation is carried out on each feature item and the rule corpus, and the highest similarity value and the corresponding corpus are output;
Step 106, judging whether the similarity value is larger than a preset threshold value, if so, executing step 107, otherwise, executing step 108;
step 107, reading the classification and grading result of the corpus corresponding to the highest similarity value as the classification and grading result of the feature item, and further obtaining the classification and grading result of the feature item matching content (namely data);
step 108, manual intervention is performed to manually classify and classify the feature items, so as to obtain classification and classification results of the matching content (namely data) of the feature items.
It should be noted that the above-mentioned embodiments are not intended to limit the scope of the present invention, and equivalent changes or substitutions made on the basis of the above-mentioned technical solutions fall within the scope of the present invention as defined in the claims.

Claims (4)

1. A data classification and classification method based on a similarity algorithm is characterized by comprising the following steps of
The data acquisition module acquires characteristic contents of massive metadata, including field names and field description contents, and automatically extracts the field names and the field description contents of the metadata after the system is accessed into the massive metadata;
The feature item module is used for carrying out data cleaning and pretreatment on the feature content to obtain a standard feature item library, and carrying out statistics and merging treatment on the standard feature item library;
The similarity algorithm module carries out similarity one by one calculation on the standard feature items output by the feature item module and a rule corpus of the known industry to obtain corpus feature items with highest similarity and larger than a preset threshold value, and the corpus feature items are used as matching results of the standard feature items,
And the classification and grading result mapping module maps the corpus feature items with highest similarity into classification and grading results in the database one by one, and then the result of the standard feature items is obtained.
2. The data classification and classification method based on similarity algorithm according to claim 1, wherein the feature item module performs data cleaning and preprocessing on feature content to obtain a standard feature item library, and performs statistics and merging processing on the standard feature item library, and specifically comprises the following steps:
Firstly, carrying out customized data cleaning on field names and field description contents, wherein the customized data cleaning comprises unified case, unified punctuation mark format, deleting symbols and contents, deleting enumeration values, and obtaining cleaned pre-characteristic item library and alarm name information; secondly, carrying out standardization processing on the pre-characteristic item library, including deleting single character, number, blank character and other pre-characteristic items to obtain a standard characteristic item library,
Firstly, counting the number of the same feature item names in a standard feature item library; secondly, the field names with the same characteristic items are combined, and as matching content corresponding to the characteristic items, the user selects to match metadata through the characteristic items or selects to match metadata through the matching content, so that the type of the metadata is identified.
3. The data classification and classification method based on the similarity algorithm according to claim 1, wherein the similarity algorithm module performs similarity one-to-one calculation on the feature items output by the feature item module and a rule corpus of the known industry to obtain a corpus with highest similarity and larger than a preset threshold value, and takes the classification and classification result as a result of the feature items, and the method specifically comprises the following steps:
Dynamically reading classification grading rules of known industries, extracting a corpus with calculated similarity, segmenting words from the corpus, converting the word into word bag vectors, creating a TF-IDF model, converting the word bag vectors into TF-IDF vectors, and finally creating a similarity matrix of the corpus;
And searching the feature item text, converting the feature item text into a TF-IDF vector, finally, calculating the similarity between the feature item text and the text of the rule corpus, judging whether the highest similarity value is larger than a preset threshold value, and if so, reading the classification and grading result of the corresponding corpus as the classification and grading result of the feature item.
4. The data classification and ranking method based on similarity algorithm according to claim 2, characterized in that the method comprises the steps of:
step 101, accessing a mass metadata into a system, and extracting a field name and a field description value through a data acquisition module;
step 102, carrying out customized data cleaning on field names and field descriptions to obtain a pre-characteristic item library;
step 103, carrying out standardization processing on the pre-characteristic item library to obtain a standard characteristic item library;
step 104, counting the number of each feature item in the standard feature item library, merging the field content of each feature item, and de-duplicating the feature item;
step 105, similarity calculation is carried out on each feature item and the rule corpus, and the highest similarity value and the corresponding corpus are output;
Step 106, judging whether the similarity value is larger than a preset threshold value, if so, executing step 107, otherwise, executing step 108;
Step 107, reading the classification and grading result of the corpus corresponding to the highest similarity value as the classification and grading result of the feature item, and further obtaining the classification and grading result of the feature item matching content, namely data;
And step 108, manually performing classification and grading on the characteristic items, and further obtaining classification and grading results of the matching content of the characteristic items, namely data.
CN202410778344.9A 2024-06-17 2024-06-17 Data classification grading method based on similarity algorithm Pending CN118349879A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410778344.9A CN118349879A (en) 2024-06-17 2024-06-17 Data classification grading method based on similarity algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410778344.9A CN118349879A (en) 2024-06-17 2024-06-17 Data classification grading method based on similarity algorithm

Publications (1)

Publication Number Publication Date
CN118349879A true CN118349879A (en) 2024-07-16

Family

ID=91819792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410778344.9A Pending CN118349879A (en) 2024-06-17 2024-06-17 Data classification grading method based on similarity algorithm

Country Status (1)

Country Link
CN (1) CN118349879A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628211A (en) * 2023-07-25 2023-08-22 中国电信股份有限公司 Data classification method and device, storage medium and electronic equipment
CN117236334A (en) * 2023-10-18 2023-12-15 贵州电网有限责任公司 Hierarchical processing method for project data security information
CN117454220A (en) * 2023-10-24 2024-01-26 中国联合网络通信集团有限公司 Data hierarchical classification method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628211A (en) * 2023-07-25 2023-08-22 中国电信股份有限公司 Data classification method and device, storage medium and electronic equipment
CN117236334A (en) * 2023-10-18 2023-12-15 贵州电网有限责任公司 Hierarchical processing method for project data security information
CN117454220A (en) * 2023-10-24 2024-01-26 中国联合网络通信集团有限公司 Data hierarchical classification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109582861B (en) Data privacy information detection system
CN110826320B (en) Sensitive data discovery method and system based on text recognition
CN110741376B (en) Automatic document analysis for different natural languages
CN112000773B (en) Search engine technology-based data association relation mining method and application
CN111967761A (en) Monitoring and early warning method and device based on knowledge graph and electronic equipment
WO2018160551A1 (en) Automatic human-emulative document analysis enhancements
CN111899090A (en) Enterprise associated risk early warning method and system
CN103034656B (en) Chapters and sections content layered approach and device, article content layered approach and device
CN114722137A (en) Security policy configuration method and device based on sensitive data identification and electronic equipment
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN118116611B (en) Database construction method based on multi-source medical and nutritional big data fusion integration
CN114969467A (en) Data analysis and classification method and device, computer equipment and storage medium
CN114707003B (en) Method, equipment and storage medium for disambiguating names of paper authors
CN116469500A (en) Data quality control method and system based on post-structuring of medical document
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes
CN113591476A (en) Data label recommendation method based on machine learning
Petrus Soft and hard clustering for abstract scientific paper in Indonesian
CN103034657B (en) Documentation summary generates method and apparatus
CN116680422A (en) Multi-mode question bank resource duplicate checking method, system, device and storage medium
CN109918638B (en) Network data monitoring method
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
CN118349879A (en) Data classification grading method based on similarity algorithm
CN114495138A (en) Intelligent document identification and feature extraction method, device platform and storage medium
CN117827991B (en) Method and system for identifying personal identification information in semi-structured data
CN118227684B (en) Data behavior tracing method based on multidimensional discrete data fingerprint

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination