CN118349879A

CN118349879A - Data classification grading method based on similarity algorithm

Info

Publication number: CN118349879A
Application number: CN202410778344.9A
Authority: CN
Inventors: 糜靖峰; 潘兵; 于姝; 顾欢欢; 时昀; 卢光青
Original assignee: NANJING SINOVATIO TECHNOLOGY CO LTD
Current assignee: NANJING SINOVATIO TECHNOLOGY CO LTD
Priority date: 2024-06-17
Filing date: 2024-06-17
Publication date: 2024-07-16

Abstract

The invention relates to a data classification and grading method based on a similarity algorithm, which is characterized in that a method for classifying and grading data assets is realized by using the similarity algorithm, a characteristic item library and a similarity algorithm module are introduced, massive metadata are read through a data acquisition module, and after customized data cleaning and standardized preprocessing are carried out on field names and field description contents of the massive metadata, a standard characteristic item library is obtained; and then, carrying out similarity calculation on each feature item and the rule corpus through a similarity algorithm module to obtain a classification and grading result of the feature item, and further obtaining a classification and grading result of data corresponding to the feature item. For different levels of data, users formulate different levels of security policies. According to the technical scheme, the method for classifying and grading the metadata sets rapidly and automatically through the similarity calculation module is adopted, so that labor cost is reduced, and the efficiency and accuracy of classifying and grading the data are improved.

Description

Data classification grading method based on similarity algorithm

Technical Field

The invention relates to a data classification and grading method based on a similarity algorithm, which is a method for classifying and grading metadata in the known industry based on the similarity algorithm and belongs to the technical field of data security.

Background

In recent years, with the rapid development of digital economy, databases are widely used in various fields, and data has become one of the important assets of various institutions and enterprises. The process of collecting, processing, storing, analyzing and managing massive data faces a plurality of security threats, and the security construction of the data is obviously important. By automatically classifying and grading the data assets, it is important to identify core data assets, important data assets and general data assets, grasp the current situation of the data assets and build a targeted safety protection system, so that the accuracy and the integrity of the data are ensured.

The traditional classification and grading work mainly depends on manpower, and has the defects of low efficiency, strong subjectivity and the like. How to automatically, accurately and quickly classify and classify metadata in a database is a current challenge to be solved. Therefore, a new solution is urgently needed to solve the above technical problems.

Disclosure of Invention

The invention provides a data classification and classification method based on a similarity algorithm, which aims at the problems in the prior art, and the technical scheme provides a method for carrying out rapid and automatic classification and classification on a metadata set through a similarity calculation module after carrying out data preprocessing on massive metadata in the known industry, so that the labor cost is reduced, and the efficiency and accuracy of data classification and classification are improved.

In order to achieve the above object, the present invention provides a data classification and classification method based on a similarity algorithm, the classification method comprising

The data acquisition module acquires characteristic contents of mass metadata, including field names and field description contents:

When the system accesses massive metadata, the field names and field description contents of the metadata are automatically extracted.

The feature item module is used for carrying out data cleaning and pretreatment on the feature content to obtain a standard feature item library, and carrying out statistics and merging treatment on the standard feature item library, and specifically comprises the following steps:

firstly, carrying out customized data cleaning on field names and field description contents, including unified case, unified punctuation mark format, deleting special symbols and contents, deleting enumeration values and the like, so as to obtain cleaned pre-characteristic item libraries and alarm name information; and secondly, carrying out standardized processing on the pre-characteristic item library, including deleting pre-characteristic items such as single characters, numbers, blank characters and the like, so as to obtain a standard characteristic item library.

Firstly, counting the number of the same feature item names in a standard feature item library; and secondly, carrying out special merging processing on field names with the same characteristic item, wherein the field names are used as matching contents corresponding to the characteristic item, so that a user can select to match metadata through the characteristic item or select to match metadata through the matching contents, and the type of the metadata is identified.

The similarity algorithm module is used for carrying out similarity one-to-one calculation on the feature items output by the feature item module and a rule corpus in the known industry to obtain a corpus with highest similarity and larger than a preset threshold value, and taking a classification grading result of the corpus as a result of the feature items:

Dynamically reading classification grading rules of known industries, extracting a corpus with calculated similarity, segmenting words from the corpus, converting the word into word bag vectors, creating a TF-IDF model, converting the word bag vectors into TF-IDF vectors, and finally creating a similarity matrix of the corpus;

And searching the feature item text, converting the feature item text into a TF-IDF vector, finally, calculating the similarity between the feature item text and the text of the rule corpus, judging whether the highest similarity value is larger than a preset threshold value, and if so, reading the classification and grading result of the corresponding corpus as the classification and grading result of the feature item.

The method comprises the following steps:

step 101, accessing a mass metadata into a system, and extracting a field name and a field description value through a data acquisition module;

step 102, carrying out customized data cleaning on field names and field descriptions to obtain a pre-characteristic item library;

step 103, carrying out standardization processing on the pre-characteristic item library to obtain a standard characteristic item library;

step 104, counting the number of each feature item in the standard feature item library, merging the field content of each feature item, and de-duplicating the feature item;

step 105, similarity calculation is carried out on each feature item and the rule corpus, and the highest similarity value and the corresponding corpus are output;

Step 106, judging whether the similarity value is larger than a preset threshold value, if so, executing step 107, otherwise, executing step 108;

step 107, reading the classification and grading result of the corpus corresponding to the highest similarity value as the classification and grading result of the feature item, and further obtaining the classification and grading result of the feature item matching content (namely data);

step 108, manual intervention is performed to manually classify and classify the feature items, so as to obtain classification and classification results of the matching content (namely data) of the feature items.

Compared with the prior art, the method has the following advantages that 1, the technical scheme introduces the characteristic item concept to identify that the metadata specifically belongs to certain type of data and mainly originates from field description content of the metadata. After massive metadata are read into the system through the data acquisition module, the system performs customized data cleaning and preprocessing on field description contents, and a standard feature item library is extracted, wherein the feature item library is the basis for classifying and grading the metadata. The data cleaning work can be flexibly defined according to different requirements, so that the user interactivity is enhanced, and the user experience is improved; 2. the scheme introduces a similarity algorithm concept to calculate the similarity value of the feature item and a classification rule corpus of the known industry. And carrying out similarity algorithm calculation on the feature items and the classification and grading rule corpus one by one to obtain a highest similarity value, and automatically obtaining a classification and grading result of the feature items when the highest similarity value exceeds a preset threshold value. The user can customize the similarity threshold according to the requirement, the larger the threshold is, the higher the accuracy of the classification and grading result is, and the manual auditing cost is greatly reduced; the smaller the threshold, the higher the coverage of the classification grading result, but the accuracy will be reduced.

Drawings

Fig. 1 is a process flow diagram of a data classification and ranking method based on a similarity algorithm of the present invention.

Detailed Description

In order to enhance the understanding of the present invention, the present embodiment will be described in detail with reference to the accompanying drawings.

Example 1: referring to fig. 1, a data classification and ranking method based on a similarity algorithm includes

The data acquisition module is mainly responsible for acquiring characteristic contents of mass metadata, and comprises field names, field descriptions, library names, table descriptions and field content sampling values: when the system is accessed into mass metadata, all relevant information of the metadata is automatically extracted; when the system accesses massive metadata, the field names and field description contents of the metadata are automatically extracted.

Firstly, carrying out customized data cleaning on field names and field description contents, wherein the field name preprocessing mainly comprises case-case conversion, pure digital name deletion, single-letter name deletion and enumerated digital deletion; the field description preprocessing mainly comprises unifying punctuation marks, deleting contents after the punctuation marks are designated, deleting enumeration numbers, deleting designated symbols, deleting non-Chinese values and the like, and a cleaned pre-characteristic item library and alarm name information are obtained; and secondly, carrying out standardized processing on the pre-characteristic item library, including deleting single characters, numbers, blank characters and the like, so as to obtain a standard characteristic item library.

Defining a mapping relation between a source word and a target word in the synonym library, wherein the target word is unified with characteristic item words of classification and grading of the known industry; and carrying out synonym replacement on the feature items of the standard feature item library.

dynamically reading classification grading rules of known industries, and extracting feature item names in the rules as a text set for similarity calculation;

preprocessing a text set: performing word segmentation processing on each characteristic item, deleting stop words and the like, and obtaining a two-dimensional list of the text set;

Constructing word bag vectors, creating TF-IDF models and transmitting the TF-IDF models into a corpus for training;

Converting the word bag vector into a new vector corpus through a trained TF-IDF model, and constructing a cosine similarity index;

And searching the feature item text, converting the feature item text into a TF-IDF vector, calculating cosine similarity between the feature item text and the rule corpus text through cosine similarity index, judging whether the highest similarity value is larger than a preset threshold value, and if so, reading the corpus feature item as a result.

And the classification and grading result mapping module maps the corpus feature items with highest similarity into classification and grading results in the database one by one, and then the result of the standard feature items is obtained.

The method comprises the following steps:

It should be noted that the above-mentioned embodiments are not intended to limit the scope of the present invention, and equivalent changes or substitutions made on the basis of the above-mentioned technical solutions fall within the scope of the present invention as defined in the claims.

Claims

1. A data classification and classification method based on a similarity algorithm is characterized by comprising the following steps of

The data acquisition module acquires characteristic contents of massive metadata, including field names and field description contents, and automatically extracts the field names and the field description contents of the metadata after the system is accessed into the massive metadata;

The feature item module is used for carrying out data cleaning and pretreatment on the feature content to obtain a standard feature item library, and carrying out statistics and merging treatment on the standard feature item library;

The similarity algorithm module carries out similarity one by one calculation on the standard feature items output by the feature item module and a rule corpus of the known industry to obtain corpus feature items with highest similarity and larger than a preset threshold value, and the corpus feature items are used as matching results of the standard feature items,

2. The data classification and classification method based on similarity algorithm according to claim 1, wherein the feature item module performs data cleaning and preprocessing on feature content to obtain a standard feature item library, and performs statistics and merging processing on the standard feature item library, and specifically comprises the following steps:

Firstly, carrying out customized data cleaning on field names and field description contents, wherein the customized data cleaning comprises unified case, unified punctuation mark format, deleting symbols and contents, deleting enumeration values, and obtaining cleaned pre-characteristic item library and alarm name information; secondly, carrying out standardization processing on the pre-characteristic item library, including deleting single character, number, blank character and other pre-characteristic items to obtain a standard characteristic item library,

Firstly, counting the number of the same feature item names in a standard feature item library; secondly, the field names with the same characteristic items are combined, and as matching content corresponding to the characteristic items, the user selects to match metadata through the characteristic items or selects to match metadata through the matching content, so that the type of the metadata is identified.

3. The data classification and classification method based on the similarity algorithm according to claim 1, wherein the similarity algorithm module performs similarity one-to-one calculation on the feature items output by the feature item module and a rule corpus of the known industry to obtain a corpus with highest similarity and larger than a preset threshold value, and takes the classification and classification result as a result of the feature items, and the method specifically comprises the following steps:

4. The data classification and ranking method based on similarity algorithm according to claim 2, characterized in that the method comprises the steps of:

Step 107, reading the classification and grading result of the corpus corresponding to the highest similarity value as the classification and grading result of the feature item, and further obtaining the classification and grading result of the feature item matching content, namely data;

And step 108, manually performing classification and grading on the characteristic items, and further obtaining classification and grading results of the matching content of the characteristic items, namely data.