CN115687725A - Data classification and classification method and device - Google Patents

Data classification and classification method and device Download PDF

Info

Publication number
CN115687725A
CN115687725A CN202211274350.8A CN202211274350A CN115687725A CN 115687725 A CN115687725 A CN 115687725A CN 202211274350 A CN202211274350 A CN 202211274350A CN 115687725 A CN115687725 A CN 115687725A
Authority
CN
China
Prior art keywords
data
classification
class
category
belongs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211274350.8A
Other languages
Chinese (zh)
Inventor
郭瑾仪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202211274350.8A priority Critical patent/CN115687725A/en
Publication of CN115687725A publication Critical patent/CN115687725A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for classifying and grading data, and relates to the technical field of computers. One embodiment of the method comprises: carrying out first classification on data to be classified and classified by using a first data classification rule to obtain a first class of the data; performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs; acquiring the grade corresponding to the second class according to a preset classification and grading rule, wherein the classification and grading rule is used for marking the mapping relation between the class to which the data belongs and the grade; and taking the second category and the level corresponding to the second category as a classification and grading result of the data. The implementation mode determines the level of the data according to the classification and classification rules and the category of the data, can be used in different industries, and has strong portability and flexibility.

Description

Data classification and classification method and device
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for classifying and grading data.
Background
With the advent of network security laws such as the personal information protection law and the data security law, enterprises need to classify and grade the stored data, thereby realizing more refined information security management. Most of the existing data classification and classification are data of specific industries, and classification are carried out through fixed classification and classification rules or artificial intelligence data models.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the method for classifying and grading data through the fixed classification and grading rules can only be used in a single industry, has poor transportability, cannot be used in different industries, and is difficult to be rapidly transplanted to other industries. The artificial intelligence data model is used for classifying and grading data, when a new data type is added or a data classification method is changed, a large amount of time and calculation power are consumed to recalculate the model, and flexible expansion and configuration are difficult.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for classifying and grading data, which can separately classify and grade data, can set classification and grading rules corresponding to different industries, determine a grade to which data belongs according to the classification and grading rules and a category of the data, can be used in different industries, and have strong portability and flexibility. Meanwhile, when the data are classified, different classification rules can be used for classifying different data, and the classification rules can be flexibly configured, so that when a new data type is added or a data classification method is changed, the corresponding data classification rules can be flexibly selected, and the classification is more flexible and convenient. In addition, for the structured data, the data can be classified only by using the metadata, so that invasion to the privacy of the user is avoided, the privacy and the data security of the user are better protected, and the method is suitable for scenes in which the data are encrypted.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a data classification and classification method, including:
carrying out first classification on data to be classified and classified by using a first data classification rule to obtain a first category to which the data belong;
performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs;
acquiring the grade corresponding to the second class according to a preset classification and grading rule, wherein the classification and grading rule is used for marking the mapping relation between the class to which the data belongs and the grade;
and taking the second category and the level corresponding to the second category as a classification and grading result of the data.
Optionally, the performing, by using a first data classification rule, a first classification on data to be classified and classified to obtain a first class to which the data belongs includes: and performing first classification on data to be classified and classified according to the storage format and the data structure of the data to obtain a first class to which the data belongs, wherein the first class comprises structured data, semi-structured data and unstructured data.
Optionally, in a case that the first category is structured data, the second data classification rule corresponding to the first category is data classification using a metadata-based data dictionary; performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, wherein the second class comprises: and performing second classification on the data by using a data dictionary based on metadata to obtain a second class to which the data belongs.
Optionally, the second classifying the data by using a data dictionary based on metadata to obtain a second category to which the data belongs includes: acquiring metadata of the data, and segmenting the metadata; performing word matching on at least one word obtained by word segmentation and the metadata-based data dictionary respectively to obtain hit words and the number of each hit word, and calculating hit rates corresponding to different types of hit words according to the number of the hit words; and determining a second category to which the data belongs according to the hit rate corresponding to the hit words of different categories.
Optionally, in a case that the first category is semi-structured data, the second data classification rule corresponding to the first category includes: at least one of data classification using a content-based data dictionary and data classification based on regular expressions; before the second classification is performed on the data by using the second data classification rule corresponding to the first class to obtain a second class to which the data belongs, the method further includes: dividing the data into regular data and irregular data consisting of specific rules according to a third data classification rule; performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, wherein the second class comprises: performing second classification on irregular data in the data by using a content-based data dictionary to obtain a third class to which the data belongs; performing second classification on regular data in the data based on the regular expression to obtain a fourth class to which the data belongs; and generating a second category to which the data belongs according to the third category and the fourth category to which the data belongs.
Optionally, the second data classification rule corresponding to the first category further includes: performing data classification using a metadata-based data dictionary; before the data is subjected to second classification by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, the method further comprises the following steps: acquiring metadata of the data; performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, wherein the second class comprises: performing second classification on the metadata of the data by using a data dictionary based on the metadata to obtain a fifth category to which the data belongs; performing second classification on irregular data in the data by using a content-based data dictionary to obtain a third class to which the data belongs; performing second classification on the regular data in the data based on the regular expression to obtain a fourth class to which the data belong; and generating a second category to which the data belongs according to the fifth category, the third category and the fourth category to which the data belongs.
Optionally, in a case that the first category is unstructured data, the second data classification rule corresponding to the first category includes: classifying the data based on an artificial intelligence data model; performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, wherein the second class comprises: and carrying out second classification on the data based on an artificial intelligence data model to obtain a second class to which the data belongs.
According to another aspect of the embodiments of the present invention, there is provided an apparatus for classifying and grading data, including:
the first classification module is used for performing first classification on data to be classified and classified by using a first data classification rule to obtain a first class to which the data belongs;
the second classification module is used for performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs;
the level acquisition module is used for acquiring the level corresponding to the second category according to a preset classification rule, and the classification rule is used for marking the mapping relation between the category to which the data belongs and the level;
and the result determining module is used for taking the second category and the level corresponding to the second category as the classification and grading result of the data.
According to another aspect of the embodiments of the present invention, there is provided an electronic device for classifying and grading data, including: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the method for classifying and grading the data provided by the embodiment of the invention.
According to a further aspect of the embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, the program, when executed by a processor, implementing the method for classifying and ranking data provided by the embodiments of the present invention.
One embodiment of the above invention has the following advantages or benefits: performing first classification on data to be classified and classified by using a first data classification rule to obtain a first class to which the data belongs; performing second classification on the data by using a second data classification rule corresponding to the first classification to obtain a second classification to which the data belongs; acquiring the grade corresponding to the second class according to a preset classification and grading rule, wherein the classification and grading rule is used for marking the mapping relation between the class to which the data belongs and the grade; the technical scheme that the second category and the level corresponding to the second category are used as the classification and grading results of the data is that the data classification and the data grading are carried out separately, classification and grading rules corresponding to different industries can be set, the level of the data is determined according to the classification and grading rules and the category of the data, the data can be used in different industries, and the portability and the flexibility are strong. Meanwhile, when the data are classified, different classification rules can be used for classifying different data, and the classification rules can be flexibly configured, so that when a new data type is added or a data classification method is changed, the corresponding data classification rules can be flexibly selected, and the classification is more flexible and convenient. In addition, for the structured data, the data can be classified only by using the metadata, so that invasion to the privacy of the user is avoided, the privacy of the user and the data security are better protected, and the method is suitable for a scene that the data is encrypted.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main steps of a method for data classification and classification according to an embodiment of the present invention;
FIG. 2 is an overall architecture diagram of a data classification and classification system of an embodiment of the present invention;
FIG. 3 is a schematic diagram of the main modules of an apparatus for classifying and grading data according to an embodiment of the present invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In order to solve the technical problems in the prior art, the invention provides a method and a device for classifying and grading data, which are used for classifying and grading data separately, can set classification and grading rules corresponding to different industries, determine the grade of the data according to the classification and grading rules and the class of the data, can be used in different industries, and have strong portability and flexibility. Meanwhile, when the data are classified, different classification rules can be used for classifying different data, and the classification rules can be flexibly configured, so that when a new data type is added or a data classification method is changed, the corresponding data classification rules can be flexibly selected, and the classification is more flexible and convenient. In addition, for the structured data, the data can be classified only by using the metadata, so that invasion to the privacy of the user is avoided, the privacy of the user and the data security are better protected, and the method is suitable for a scene that the data is encrypted.
Fig. 1 is a schematic diagram of the main steps of a data classification and classification method according to an embodiment of the present invention. As shown in fig. 1, the method for classifying and grading data according to the embodiment of the present invention mainly includes the following steps S101 to S104.
Step S101: and carrying out first classification on the data to be classified and classified by using a first data classification rule to obtain a first class of the data. In an embodiment of the present invention, in order to facilitate data classification for data in various formats, storage forms, and the like, the data may be first classified.
Specifically, according to an embodiment of the present invention, the performing, by using a first data classification rule, a first classification on the data to be classified and classified to obtain a first category to which the data belongs may specifically include: and performing first classification on the data to be classified and classified according to the storage format and the data structure of the data to obtain a first class to which the data belongs, wherein the first class comprises structured data, semi-structured data and unstructured data. The structured data is also called row data, is logically expressed and realized by a two-dimensional table structure, strictly follows the data format and length specification, and is mainly stored and managed by a relational database. Such as Enterprise ERP (Enterprise Resource Planning), financial System, medical HIS (Hospital Information System) database, educational all-purpose card, government administration approval, data stored in other core databases, and the like. Unstructured data is data which is irregular or incomplete in data structure, has no predefined data model and is inconvenient to represent by a database two-dimensional logic table, such as WORD, PDF, PPT and EXL documents, and pictures and videos in various formats. Semi-structured data: the non-relational model stores data in a substantially fixed structure schema, such as a log file, an XML (extensible markup language) document, a JSON (JSON Object Notation) document, email (mail data), and the like.
Step S102: and carrying out second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs. By using the first data classification rule to perform the first classification on the data, a second data classification rule which is more detailed and accurate can be selected based on the first classification result, so that the data can be further accurately classified.
According to one embodiment of the invention, in the case that the first category is structured data, the second data classification rule corresponding to the first category is data classification using a metadata-based data dictionary. Performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, which may specifically include: and performing second classification on the data by using a data dictionary based on the metadata to obtain a second category to which the data belongs. For the structured data, management is generally performed in the form of a relational database table, the storage structure and the specification of the structured data are strict, and therefore, in the embodiment of the invention, the structured data is classified by using a data dictionary based on metadata. The metadata-based data dictionary is formulated after statistical analysis according to metadata names and occurrence frequencies of descriptions, and metadata information which possibly appears in the classification basis is stored.
In an embodiment of the present invention, performing a second classification on the data by using a metadata-based data dictionary to obtain a second category to which the data belongs may specifically include: acquiring metadata of the data, and segmenting the metadata; performing word matching on at least one word obtained by word segmentation and the metadata-based data dictionary respectively to obtain hit words and the number of each hit word, and calculating hit rates corresponding to different types of hit words according to the number of the hit words; and determining a second category to which the data belongs according to the hit rate corresponding to the hit words of different categories. Specifically, when the data is classified secondly by using the metadata-based data dictionary, the data is firstly segmented, and then the segmentation result is matched with the metadata-based data dictionary one by one, and the word identical to the word obtained by segmentation is the hit word. And then, calculating the hit rate of each hit word, wherein the hit rate is the proportion of the number of the hit words to the total number of words obtained by data word segmentation, and calculating according to a specific formula, and the formula can be adjusted according to the situation. And finally, taking the category corresponding to the word with the highest hit rate as a classification result of the data, namely the second category to which the data belongs. For example, when structured data stored in the Mysql database is classified, the classification result of the data table and the field can be obtained only by performing data classification according to metadata such as the database name, the table name and the table description, the field name and the field description, and the like.
In another embodiment of the present invention, in a case that the first category is semi-structured data, the second data classification rule corresponding to the first category includes: at least one of data classification using a content-based data dictionary and data classification based on regular expressions. Before the data is subjected to second classification by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, the method further comprises the following steps: the data is divided into regular data and irregular data composed of specific rules according to a third data classification rule. Performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, which may specifically include: performing second classification on irregular data in the data by using a content-based data dictionary to obtain a third class to which the data belong; performing second classification on the rule data in the data based on the regular expression to obtain a fourth class to which the data belongs; and generating a second category to which the data belongs according to the third category and the fourth category to which the data belongs.
For semi-structured data, it is data with a basic fixed structure mode, so when classifying it, it is necessary to first classify the data into regular data and irregular data composed of specific rules according to a third data classification rule, where the regular data includes, for example: the mobile phone number, the identification number and the like are data composed of specific rules, and the irregular data comprises place names, names and the like which are not data generated according to the specific rules. In the embodiment of the present invention, data can be classified more accurately by further dividing the data into regular data and irregular data and classifying the data using different classification methods, respectively.
In an embodiment of the present invention, for irregular data, it may be classified secondly using a content-based data dictionary, wherein the content-based data dictionary is constructed after statistical analysis based on a large number of data samples, and data that can be used for identifying data categories and association relations between data categories are saved. The content-based data dictionary may be used, for example, to data sort data types that include a determined number of cities, provinces, nationalities, and the like. For the rule data, it may be classified secondly based on a regular expression, for example, generated after feature extraction based on a specific rule followed by the rule data. The regular expression can be used for carrying out data classification on data types consisting of specific rules, such as mobile phone numbers, identification card numbers, license plate numbers, dates and the like.
Since the semi-structured data may include both regular data and irregular data, classifying the regular data and the irregular data respectively can obtain a fourth category to which the regular data belongs and a third category to which the irregular data belongs, and then integrating the regular data and the corresponding fourth category as well as the irregular data and the corresponding third category to obtain a second category corresponding to the semi-structured data. Specifically, for example, the semi-structured data stored in the Mongo database is classified, data classification needs to be performed according to all key value pair (document) data therein, all key value pair data are classified by using a content-based data dictionary and a regular expression, and finally, a classification result of each key value pair in the data table is obtained.
In another embodiment of the present invention, when the semi-structured data is classified, the metadata corresponding to the semi-structured data can be used to perform a preliminary classification, and then perform a data classification to perform a data classification more accurately. Specifically, the second data classification rule corresponding to the first class further includes: data classification is performed using a metadata-based data dictionary. Before performing a second classification on the data using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, the method further includes: metadata of the data is obtained. Performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, wherein the second class comprises the following steps: performing second classification on the metadata of the data by using a metadata-based data dictionary to obtain a fifth category to which the data belongs; performing second classification on irregular data in the data by using a content-based data dictionary to obtain a third class to which the data belongs; performing second classification on regular data in the data based on the regular expression to obtain a fourth class to which the data belongs; and generating a second category to which the data belongs according to a fifth category, a third category and a fourth category to which the data belongs.
In this embodiment, for the semi-structured data, the metadata of the data is first classified by using the metadata-based data dictionary to obtain a fifth category to which the data belongs, for example, a classification result corresponding to a data table in a Mongo database. Since the semi-structured data has less metadata information, the content of the data needs to be further classified, so that the classification result is more accurate. When data content is classified, namely a data dictionary based on content is used for carrying out second classification on irregular data in the data to obtain a third class to which the data belongs; and performing second classification on the regular data in the data based on the regular expression to obtain a fourth class to which the data belongs. Then, the metadata of the data and the corresponding fifth category thereof, the regular data and the corresponding fourth category thereof, and the irregular data and the corresponding third category thereof are integrated to obtain the second category corresponding to the semi-structured data. Specifically, for example, the semi-structured data stored in the Mongo database is classified, data classification needs to be performed according to all key value pair (document) data and metadata of the data table, a data dictionary based on the metadata is used for performing preliminary classification on the metadata of the data, then a data dictionary based on the content and a regular expression are used for further classifying all key value pair data, and finally a classification result of each key value pair in the data table and the data table is obtained.
In a further embodiment of the present invention, in a case where the first category is unstructured data, the second data classification rule corresponding to the first category includes: and carrying out data classification based on the artificial intelligence data model. Performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, which may specifically include: and carrying out second classification on the data based on an artificial intelligence data model to obtain a second class to which the data belongs. In embodiments of the present invention, an artificial intelligence data model may be used to classify unstructured data, such as: and classifying data such as strategic planning of companies, project management documents and the like by using an artificial intelligence data model. The artificial intelligence data model can be trained in advance, and existing data in industries or enterprises can be trained by using a deep learning or machine learning method and adopting a supervised learning mode when the data model is trained. The algorithm and the specific form adopted by the artificial intelligence data model during training depend on the existing data available by the industry or enterprises. The present invention is not particularly limited in this regard.
In the embodiment of the present invention, for the second data classification rule, the user and the administrator may flexibly set, for example, a category of the data dictionary may be added, or dictionary content of an existing category may be extended and modified, a type identified by a regular expression may be added, or an existing regular expression may be modified, and the like.
Step S103: and acquiring the grade corresponding to the second category according to a preset classification and grading rule, wherein the classification and grading rule is used for marking the mapping relation between the category to which the data belongs and the grade. The classification and classification rules of the embodiment of the invention can be flexibly set, and users and administrators can add new classification and classification rules of industries or modify, add and delete the classification and classification rules in the existing rule templates. After the categories corresponding to the data are obtained according to the foregoing steps S101 and S102, the corresponding levels can be obtained according to the set classification and classification rules of different industries. For the mapping relationship between the category and the level marked in the classification and classification rule, the user and the administrator can add, delete or modify the mapping relationship between the classification result and the rule according to the requirement. In the embodiment of the invention, the mapping relation between the categories and the levels and the classification and grading rules can be freely configured by a user or a system administrator, so that different industrial scenes can be quickly and flexibly dealt with, and customized data classification and grading are carried out.
Step S104: and taking the second category and the level corresponding to the second category as a classification and grading result of the data.
FIG. 2 is an overall architecture diagram of a data classification and ranking system of an embodiment of the present invention. As shown in fig. 2, the data classification and classification system according to the embodiment of the present invention mainly includes three parts, namely, classification rules of data, classification and classification rules, and correspondence relationships between the classification rules and the classification and classification rules. Embodiments of the present invention subject data to a first sort of bit structured data, semi-structured data, and unstructured data. Classifying the structured data by using a data dictionary based on metadata; for semi-structured data, a data dictionary based on metadata, a data dictionary based on content and a regular expression can be used for classifying the semi-structured data, specifically, the metadata of the semi-structured data can be obtained firstly, the semi-structured data is divided into regular data and irregular data, then the metadata of the data is classified by using the data dictionary based on the metadata, the irregular data in the data is classified by using the data dictionary based on the content, the regular data in the data is classified by using the regular expression, and finally, a data classification result is obtained by comprehensive summarization; for unstructured data, the data is classified using an artificial intelligence data model. And then, according to the preset corresponding relation, the system corresponds the identified data category with the classification and classification rule, and finally obtains the classification and classification result of the data. The classification and classification rules are, for example, general rules of various industries, financial industry rules, medical industry rules, and the like.
Fig. 3 is a schematic diagram of main modules of a data classification and classification device according to an embodiment of the present invention. As shown in fig. 3, the apparatus 300 for classifying and grading data according to the embodiment of the present invention mainly includes: a first classification module 301, a second classification module 302, a level acquisition module 303 and a result determination module 304.
A first classification module 301, configured to perform first classification on data to be classified and classified by using a first data classification rule to obtain a first category to which the data belongs;
a second classification module 302, configured to perform second classification on the data using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs;
a level obtaining module 303, configured to obtain a level corresponding to the second category according to a preset classification rule, where the classification rule is used to mark a mapping relationship between a category to which the data belongs and the level;
a result determining module 304, configured to use the second category and a level corresponding to the second category as a classification result of the data.
According to an embodiment of the present invention, the first classification module 301 may be further configured to: and performing first classification on data to be classified and classified according to the storage format and the data structure of the data to obtain a first class to which the data belongs, wherein the first class comprises structured data, semi-structured data and unstructured data.
According to another embodiment of the present invention, in the case that the first category is structured data, the second data classification rule corresponding to the first category is data classification using a metadata-based data dictionary; the second classification module 302 may also be configured to: and performing second classification on the data by using a data dictionary based on metadata to obtain a second class to which the data belongs.
According to another embodiment of the present invention, the second classification module 302 may specifically be configured to: acquiring metadata of the data, and segmenting the metadata; performing word matching on at least one word obtained by word segmentation and the metadata-based data dictionary respectively to obtain hit words and the number of each hit word, and calculating hit rates corresponding to different types of hit words according to the number of the hit words; and determining a second category to which the data belongs according to the hit rate corresponding to the hit words of different categories.
According to another embodiment of the present invention, in the case that the first category is semi-structured data, the second data classification rule corresponding to the first category includes: at least one of data classification using a content-based data dictionary and data classification based on regular expressions. The apparatus 300 for classifying and grading data according to the embodiment of the present invention may further include a third classification module (not shown in the drawings) for: and before the data is subjected to second classification by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, the data is divided into regular data and irregular data which are composed of specific rules according to a third data classification rule. Also, the second classification module 302 may be further configured to: performing second classification on irregular data in the data by using a content-based data dictionary to obtain a third class to which the data belongs; performing second classification on regular data in the data based on the regular expression to obtain a fourth class to which the data belongs; and generating a second category to which the data belongs according to the third category and the fourth category to which the data belongs.
According to another embodiment of the present invention, the second data classification rule corresponding to the first category further includes: performing data classification using a metadata-based data dictionary; the apparatus 300 for classifying and grading data of the embodiment of the present invention may further include a metadata obtaining module (not shown in the figure) configured to: and acquiring metadata of the data before performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs. Also, the second classification module 302 may be further configured to: performing second classification on the metadata of the data by using a data dictionary based on the metadata to obtain a fifth category to which the data belongs; performing second classification on irregular data in the data by using a content-based data dictionary to obtain a third class to which the data belongs; performing second classification on regular data in the data based on the regular expression to obtain a fourth class to which the data belongs; and generating a second category to which the data belongs according to the fifth category, the third category and the fourth category to which the data belongs.
According to another embodiment of the present invention, in the case that the first category is unstructured data, the second data classification rule corresponding to the first category comprises: and classifying the data based on the artificial intelligence data model. Also, the second classification module 302 may be further configured to: and carrying out second classification on the data based on an artificial intelligence data model to obtain a second class to which the data belongs.
According to the technical scheme of the embodiment of the invention, the data to be classified and graded is subjected to first classification by using a first data classification rule to obtain a first category to which the data belongs; performing second classification on the data by using a second data classification rule corresponding to the first classification to obtain a second classification to which the data belongs; acquiring the grade corresponding to the second class according to a preset classification and grading rule, wherein the classification and grading rule is used for marking the mapping relation between the class to which the data belongs and the grade; the technical scheme that the second category and the level corresponding to the second category are used as the classification and grading results of the data is that the data classification and the data grading are carried out separately, classification and grading rules corresponding to different industries can be set, the level of the data is determined according to the classification and grading rules and the category of the data, the data can be used in different industries, and the portability and the flexibility are strong. Meanwhile, when the data are classified, different classification rules can be used for classifying different data, and the classification rules can be flexibly configured, so that when a new data type is added or a data classification method is changed, the corresponding data classification rules can be flexibly selected, and the classification is more flexible and convenient. In addition, for the structured data, the data can be classified only by using the metadata, so that invasion to the privacy of the user is avoided, the privacy of the user and the data security are better protected, and the method is suitable for a scene that the data is encrypted.
Fig. 4 illustrates an exemplary system architecture 400 of a method of data classification ranking or an apparatus of data classification ranking to which embodiments of the present invention may be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the terminal devices 401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 401, 402, 403 to interact with a server 405 via a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have installed thereon various communication client applications, such as data management-like applications, data analysis applications, search-like applications, social platform software, etc. (by way of example only).
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 405 may be a server providing various services, such as a background management server (for example only) providing support for a data classification hierarchical website browsed by a user using the terminal devices 401, 402, 403. The background management server can perform first classification on data to be classified and classified by using a first data classification rule on the received data classification and classification requests and the like to obtain a first category of the data; performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs; acquiring the grade corresponding to the second class according to a preset classification and grading rule, wherein the classification and grading rule is used for marking the mapping relation between the class to which the data belongs and the grade; and processing the second category and the level corresponding to the second category as a classification and grading result of the data, and feeding back the processing result (such as the classification and grading result-just an example) to the terminal equipment.
It should be noted that the method for classifying and grading data provided by the embodiment of the present invention is generally performed by the server 405, and accordingly, the device for classifying and grading data is generally disposed in the server 405.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.
Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use with a terminal device or server implementing an embodiment of the invention is shown. The terminal device or the server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU) 501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. A drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware. The described units or modules may also be provided in a processor, and may be described as: a processor includes a first classification module, a second classification module, a level acquisition module, and a result determination module. Where the names of these units or modules do not in some cases constitute a limitation of the unit or module itself, for example, the first classification module may also be described as "a module for performing a first classification of data to be classified and ranked using a first data classification rule to obtain a first category to which the data belongs".
As another aspect, the present invention also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: carrying out first classification on data to be classified and classified by using a first data classification rule to obtain a first class to which the data belongs; performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs; acquiring the grade corresponding to the second category according to a preset classification and grading rule, wherein the classification and grading rule is used for marking the mapping relation between the category to which the data belongs and the grade; and taking the second category and the level corresponding to the second category as a classification and grading result of the data.
According to the technical scheme of the embodiment of the invention, data to be classified and classified are subjected to first classification by using a first data classification rule to obtain a first class to which the data belong; performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs; acquiring the grade corresponding to the second class according to a preset classification and grading rule, wherein the classification and grading rule is used for marking the mapping relation between the class to which the data belongs and the grade; the technical scheme that the second category and the level corresponding to the second category are used as the classification and grading results of the data is that the data classification and the data grading are carried out separately, classification and grading rules corresponding to different industries can be set, the level of the data is determined according to the classification and grading rules and the category of the data, the data can be used in different industries, and the portability and the flexibility are strong. Meanwhile, when the data are classified, different classification rules can be used for classifying different data, and the classification rules can be flexibly configured, so that when a new data type is added or a data classification method is changed, the corresponding data classification rules can be flexibly selected, and the classification is more flexible and convenient. In addition, for the structured data, the data can be classified only by using the metadata, so that invasion to the privacy of the user is avoided, the privacy of the user and the data security are better protected, and the method is suitable for a scene that the data is encrypted.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of data classification and ranking, comprising:
carrying out first classification on data to be classified and classified by using a first data classification rule to obtain a first class to which the data belongs;
performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs;
acquiring the grade corresponding to the second class according to a preset classification and grading rule, wherein the classification and grading rule is used for marking the mapping relation between the class to which the data belongs and the grade;
and taking the second category and the level corresponding to the second category as a classification and grading result of the data.
2. The method of claim 1, wherein performing a first classification on data to be classified and ranked using a first data classification rule to obtain a first category to which the data belongs comprises:
the method comprises the steps of carrying out first classification on data to be classified and classified according to the storage format and the data structure of the data to obtain a first class to which the data belong, wherein the first class comprises structured data, semi-structured data and unstructured data.
3. The method according to claim 2, wherein in the case that the first category is structured data, the second data classification rule corresponding to the first category is data classification using a metadata-based data dictionary;
performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, wherein the second class comprises:
and performing second classification on the data by using a data dictionary based on metadata to obtain a second category to which the data belongs.
4. The method of claim 3, wherein the second classification of the data using a metadata-based data dictionary into a second category to which the data belongs comprises:
acquiring metadata of the data, and segmenting the metadata;
performing word matching on at least one word obtained by word segmentation and the data dictionary based on the metadata to obtain hit words and the number of each hit word, and calculating the hit rates corresponding to different types of hit words according to the number of the hit words;
and determining a second category to which the data belongs according to the hit rate corresponding to the hit words of different categories.
5. The method according to claim 2, wherein in the case that the first category is semi-structured data, the second data classification rule corresponding to the first category comprises: at least one of data classification using a content-based data dictionary and data classification based on regular expressions;
before the second classification is performed on the data by using the second data classification rule corresponding to the first class to obtain a second class to which the data belongs, the method further includes:
dividing the data into regular data and irregular data consisting of specific rules according to a third data classification rule;
performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, wherein the second class comprises:
performing second classification on irregular data in the data by using a content-based data dictionary to obtain a third class to which the data belong;
performing second classification on the regular data in the data based on the regular expression to obtain a fourth class to which the data belong;
and generating a second category to which the data belongs according to the third category and the fourth category to which the data belongs.
6. The method of claim 5, wherein the second data classification rule corresponding to the first category further comprises: performing data classification using a metadata-based data dictionary;
before the data is subjected to second classification by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, the method further comprises the following steps:
acquiring metadata of the data;
performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, wherein the second class comprises:
performing second classification on the metadata of the data by using a metadata-based data dictionary to obtain a fifth category to which the data belongs;
performing second classification on irregular data in the data by using a content-based data dictionary to obtain a third class to which the data belong;
performing second classification on regular data in the data based on the regular expression to obtain a fourth class to which the data belongs;
and generating a second category to which the data belongs according to a fifth category, a third category and a fourth category to which the data belongs.
7. The method according to claim 2, wherein in the case that the first category is unstructured data, the second data classification rule corresponding to the first category comprises: classifying data based on an artificial intelligence data model;
performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs, wherein the second class comprises:
and carrying out second classification on the data based on an artificial intelligence data model to obtain a second class to which the data belongs.
8. An apparatus for classifying and grading data, comprising:
the first classification module is used for performing first classification on data to be classified and classified by using a first data classification rule to obtain a first class to which the data belongs;
the second classification module is used for performing second classification on the data by using a second data classification rule corresponding to the first class to obtain a second class to which the data belongs;
the level acquisition module is used for acquiring the level corresponding to the second category according to a preset classification rule, and the classification rule is used for marking the mapping relation between the category to which the data belongs and the level;
and the result determining module is used for taking the second category and the level corresponding to the second category as the classification and grading result of the data.
9. An electronic device for data classification and classification, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202211274350.8A 2022-10-18 2022-10-18 Data classification and classification method and device Pending CN115687725A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211274350.8A CN115687725A (en) 2022-10-18 2022-10-18 Data classification and classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211274350.8A CN115687725A (en) 2022-10-18 2022-10-18 Data classification and classification method and device

Publications (1)

Publication Number Publication Date
CN115687725A true CN115687725A (en) 2023-02-03

Family

ID=85066892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211274350.8A Pending CN115687725A (en) 2022-10-18 2022-10-18 Data classification and classification method and device

Country Status (1)

Country Link
CN (1) CN115687725A (en)

Similar Documents

Publication Publication Date Title
US20190163742A1 (en) Method and apparatus for generating information
US20170286489A1 (en) Data processing
US20210174277A1 (en) Compliance management for emerging risks
US11681817B2 (en) System and method for implementing attribute classification for PII data
CN110689268B (en) Method and device for extracting indexes
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
US10146881B2 (en) Scalable processing of heterogeneous user-generated content
CN111553556A (en) Business data analysis method and device, computer equipment and storage medium
CN111143505B (en) Document processing method, device, medium and electronic equipment
CN111651552A (en) Structured information determination method and device and electronic equipment
US11163761B2 (en) Vector embedding models for relational tables with null or equivalent values
CN114036921A (en) Policy information matching method and device
CN116450723A (en) Data extraction method, device, computer equipment and storage medium
US20200019647A1 (en) Detection of missing entities in a graph schema
CN112256566B (en) Fresh-keeping method and device for test cases
CN115687725A (en) Data classification and classification method and device
CN111368036B (en) Method and device for searching information
CN113095078A (en) Associated asset determination method and device and electronic equipment
US9251125B2 (en) Managing text in documents based on a log of research corresponding to the text
CN113742321A (en) Data updating method and device
CN110908663A (en) Service problem positioning method and positioning device
CN115658901A (en) Method, device, equipment and computer readable medium for data classification
CN113362097B (en) User determination method and device
WO2021155711A1 (en) Method and apparatus for identifying attribute word of article, and device and storage medium
CN115878705A (en) Index query method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination