CN117556050A - Data classification and classification method and device, electronic equipment and storage medium - Google Patents

Data classification and classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117556050A
CN117556050A CN202410044807.9A CN202410044807A CN117556050A CN 117556050 A CN117556050 A CN 117556050A CN 202410044807 A CN202410044807 A CN 202410044807A CN 117556050 A CN117556050 A CN 117556050A
Authority
CN
China
Prior art keywords
field
rule
category
description
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410044807.9A
Other languages
Chinese (zh)
Other versions
CN117556050B (en
Inventor
张子辰
张超
李健
何小朝
陈林生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun Jida Zhengyuan Information Technology Co ltd
Original Assignee
Changchun Jida Zhengyuan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun Jida Zhengyuan Information Technology Co ltd filed Critical Changchun Jida Zhengyuan Information Technology Co ltd
Priority to CN202410044807.9A priority Critical patent/CN117556050B/en
Publication of CN117556050A publication Critical patent/CN117556050A/en
Application granted granted Critical
Publication of CN117556050B publication Critical patent/CN117556050B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses a data classification and classification method, a device, electronic equipment and a storage medium, and relates to the technical field of data processing. The method comprises the following steps: acquiring field names and corresponding field values of the classified data sets; when the hierarchical data set to be classified is a content data set, a field description prompt word is acquired for a single field, and according to the field description prompt word, a large language model is adopted to carry out semantic understanding on a field value, so as to obtain field description of the field value; acquiring rule types, rule levels and field types of rules in a rule table; for a single field, calculating the similarity between the field and each rule according to the field name, the field description, the corresponding field value, the rule category and the field category of each rule of the field, obtaining at least one rule with higher similarity corresponding to the field, and determining the data category and the data level of the classified data set. The technical scheme of the embodiment of the invention improves generalization, efficiency and accuracy of data classification and classification.

Description

Data classification and classification method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data classification and classification method, a device, an electronic apparatus, and a storage medium.
Background
The data classification and grading is to determine the data class and the data level of massive service data according to the classification and grading rule, so that the service system determines the processing mode of the service data.
The existing classification and grading mode generally adopts technologies such as keywords, regular expressions, data dictionaries and the like, and realizes classification and grading of data by combining manual verification.
However, the existing mode can only classify and classify the data set with obvious format rules, the limitation of data classification and classification is large, the data classification and classification depends on manpower, the efficiency and the accuracy of the data classification and classification are difficult to guarantee, and improvement is needed.
Disclosure of Invention
The invention provides a data classification and classification method, a device, electronic equipment and a storage medium, which improve generalization, efficiency and accuracy of data classification and classification.
According to an aspect of the present invention, there is provided a data classification grading method, the method comprising:
acquiring a field name and a corresponding field value of at least one field in the hierarchical data set to be classified;
When the hierarchical data set to be classified is a content data set, a field description prompt word is acquired for a single field, and according to the field description prompt word, a large language model is adopted to carry out semantic understanding on a field value, so as to obtain field description of the field value;
acquiring rule category, rule level and field category of at least one rule in the rule table; the rule is used for determining a processing mode of the classified data set to be classified;
for a single field, calculating the similarity between the field and each rule according to the field name, the field description, the corresponding field value, the rule category and the field category of each rule of the field, obtaining at least one rule with higher similarity corresponding to the field, and determining the data category and the data level of the classified data set.
According to another aspect of the present invention, there is provided a data classification grading apparatus, the apparatus comprising:
the classified data set acquisition module is used for acquiring a field name and a corresponding field value of at least one field in the classified data set;
the field description determining module is used for acquiring a field description prompt word for a single field when the hierarchical data set to be classified is a content data set, and carrying out semantic understanding on a field value by adopting a large language model according to the field description prompt word to obtain the field description of the field value;
The rule category acquisition module is used for acquiring rule categories, rule levels and field categories of at least one rule in the rule table; the rule is used for determining a processing mode of the classified data set to be classified;
the data classification and grading module is used for calculating the similarity between the field and each rule according to the field name, the field description, the corresponding field value, the rule category and the field category of each rule of the field aiming at the single field to obtain at least one rule with higher similarity corresponding to the field, and determining the data category and the data grade of the classified data set to be classified.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the data classification method according to any one of the embodiments of the invention.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the data classification method according to any embodiment of the present invention when executed.
According to the technical scheme, when the to-be-classified data set is the content data set, the field description prompt word is acquired for a single field, the field value is subjected to semantic understanding by adopting a large language model according to the field description prompt word to obtain the field description of the field value, the rule category, the rule level and the field category of at least one rule in the rule table are acquired, and for a single field, the similarity between the field and each rule is calculated according to the field name, the field description, the corresponding field value and the rule category and the field category of each rule of the field to obtain at least one rule with higher similarity corresponding to the field, so that the data category and the data level of the to-be-classified data set are determined, the problem that the limitation of data classification is large in the existing mode only can be performed on the data set with obvious format rule is solved, the generalization of the classification of the data classification is improved, meanwhile, the problem that the efficiency and the accuracy of the data classification are difficult to guarantee due to the fact that the classification is dependent on manpower is solved, and the classification efficiency and the classification accuracy of the data classification are improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a data classification and ranking method according to a first embodiment of the invention;
FIG. 2 is a flow chart of a data classification and classification method according to a second embodiment of the present invention;
FIG. 3 is a flow chart of a data classification and classification method according to a second embodiment of the present invention;
FIG. 4 is a flow chart of a data classification and ranking method according to a third embodiment of the invention;
FIG. 5 is a flow chart of a data classification and ranking method according to a third embodiment of the invention;
Fig. 6 is a schematic structural diagram of a data classification and classification device according to a fourth embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device implementing a data classification and classification method according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a data classification and classification method according to an embodiment of the present invention. The embodiment of the invention is applicable to the situation of classifying and grading the service data of the service system, the method can be executed by a data classifying and grading device which can be realized in the form of hardware and/or software and can be configured in electronic equipment carrying the data classifying and grading function.
Referring to the data classification and classification method shown in fig. 1, the method includes:
s101, acquiring a field name and a corresponding field value of at least one field in the hierarchical data set to be classified.
The business system has massive business data, and different business data need to be processed in different modes. Therefore, classification and grading are required to be carried out on the service data so as to determine the corresponding processing mode of the service data. The data set to be classified and classified can be the data set which needs to be classified and classified in the service system. By way of example, the data form of the hierarchical data set to be classified may be in the form of a database table, an Excel table, text, or the like, without limitation. Alternatively, the hierarchical data to be classified may include at least one field. A field may contain a field name and corresponding at least one field value. Wherein the field name may be used to identify field values belonging to the same attribute. Because of different configuration modes of different service systems, field names belonging to the same attribute in different service systems may be different. Illustratively, the attribute of the field value may be industry information; in the service system 1, a field name corresponding to the industry information may be "hangye"; in the service system 2, the field name corresponding to the industry information may be "HY"; in the business system 3, a field name corresponding to the Industry information may be "Industry"; in the business system 4, the field name corresponding to the industry information may be "02". The field value may be specific content of the service data corresponding to the field. Referring to the previous example, the field value may be "energy", "dining", "service" or "mining", etc.
Specifically, the service system can be queried to obtain the field names and the corresponding field values of the classified data sets to be classified in the service system.
S102, when the hierarchical data set to be classified is a content data set, a field description prompt word is obtained for a single field, and according to the field description prompt word, a large language model is adopted to carry out semantic understanding on a field value, so as to obtain the field description of the field value.
From the actual content perspective of the data set, the hierarchical data set to be classified may include a content data set and a metadata set. It is understood that the content data set is different from the actual content contained in the metadata set. Wherein the content data set may be used for recording service data of the service system. The content data set includes field names and corresponding field values. The metadata set may be used to describe the business data of the business system. The metadata set contains field names, but the corresponding field values are null. The field value corresponding to a single field may be understood as a field value having the same field name. Alternatively, in the content data set, the field value corresponding to the single field may be the data in the same row or column in the content data set. The field description may be an attribute of a field value corresponding to the same field name. Referring to the previous example, the field description may be "industry information".
The large language model can be used for semantic understanding of the content of the field values and reasoning the attributes corresponding to the field values. The large language model can fully understand the internal rule of natural language, can distinguish the meaning of natural language vocabulary and can generate human readable dialogue conforming to logic after training of massive natural language texts with different specifications. The large language model can convert natural language tasks into dialogue tasks, can utilize a set of model structures to match different fine tuning parameters to cope with various task types, and simultaneously, the large language model also shows remarkable generalization and emergence capability. The field description prompt word can be used for prompting the large language model to infer the field description corresponding to the field value according to the content of the field value. Alternatively, the field description hint may include text that needs to be described (i.e., the field value corresponding to a single field) and the description requirements. By way of example, a field description hint may be "please infer the attributes, types, or uses of the following text: { Text } "; the Text is Text to be summarized, namely a field value corresponding to a single field. Alternatively, the field description hint words may be pre-generated and stored by a technician and may be invoked when the classification of the data is performed.
Specifically, the actual content included in the hierarchical data set to be classified may be detected, and when the hierarchical data set to be classified is detected as the content data set, a field description prompt word generated in advance may be obtained for a single field. And according to the field description prompt words, guiding the large language model, carrying out semantic understanding on the field values corresponding to the single fields, and reasoning the field description of the field values.
Alternatively, the large language model may generate a language model for the migratable network-based autoregressions. The probability distribution of the next word can be inferred from a piece of text and the next word can be determined from the distribution.
Illustratively, the input text is passed through a word segmenter of a large language model to obtain its word ID sequence. The word sequence passes through the embedding layer of the large language model to obtain the word embedding sequence. The word embedding sequence sequentially passes through all word frequency blocks of the model to obtain the hidden vector of the next word. The hidden vector of the next word passes through the output layer to obtain the probability distribution of the next word. And selecting the serial number with the highest probability as the ID of the next word.
For example, the large language model calculation flow may be characterized by the following formula:
wherein T is in For entering text; the Tokenizer is a word segmentation device; i in Is a word ID sequence; embLayer is an embedded layer function of the large language model; e (E) in Embedding sequence TFBlock for word ×NLayers An embedded layer which is a large language model; h out Is a hidden vector for the next word; outLayer is the output layer of the large language model; p (P) out Probability distribution for the next word; argmax is a function of selecting the sequence number with the largest probability; i out Is the ID of the next word.
In an alternative embodiment of the present invention, for a single field, a field description hint is obtained, and according to the field description hint, a large language model is used to perform semantic understanding on a field value to obtain a field description of the field value, including: sampling field values contained in the content data set aiming at a single field to obtain a field value sampling result corresponding to the field; and aiming at a single field, acquiring a field description prompt word, and carrying out semantic understanding on a field value sampling result by adopting a large language model according to the field description prompt word to obtain the field description of the field value.
The field value sampling result may be a sampling result of a field value corresponding to a single field. The business system contains a huge amount of business data. The number of field values corresponding to a single field in the content data set is very large. Sampling the field values corresponding to the individual fields can reduce the amount of input data for a large language model.
Specifically, for a single field, the field value included in the content data set may be sampled, and the field value of the preset sampling number is selected to determine a field value sampling result corresponding to the field. The pre-generated field description hint may be obtained for a single field. The large language model can be guided according to the field description prompt words, semantic understanding can be carried out on the field value sampling result, and the field description of the field value can be inferred. The preset number of samples may be a preset number of samples of the field value. Alternatively, the preset number of samples may be set and adjusted empirically by the skilled artisan. The preset number of samples may be, for example, 5-10.
The method comprises the steps of pre-sampling field values to obtain field value sampling results corresponding to the fields, carrying out semantic understanding on the field value sampling results by adopting a large language model, determining field description of the field values, reducing the number of the field values input by the large language model, and improving the data processing efficiency of the large language model; meanwhile, too many input field values are avoided, the large language model cannot process the field values, and the fault tolerance of the data classification and classification method is improved.
In an alternative embodiment of the present invention, before obtaining the rule category, the rule level, and the field category of at least one rule in the rule table, the method further includes: when the hierarchical data set to be classified is a metadata set, acquiring a field name and a corresponding field description for a single field; aiming at a single field, splicing the field name, the field description and the field value to obtain a field splicing result of the field; wherein the field value is null.
The metadata set may be used to describe the business data of the business system. The metadata set contains a field name and a corresponding field description, but the corresponding field value is null. The metadata set has field descriptions corresponding to the fields, and can be directly used without reasoning the field descriptions.
Specifically, when detecting that the hierarchical data set to be classified is a metadata set, a field name and a corresponding field description may be acquired for a single field. The field name, the field description and the field value can be spliced by adopting the content separator aiming at a single field to obtain a field splicing result of the field.
According to the scheme, when the data set to be classified and classified is the metadata set, the field names and the corresponding field descriptions are directly acquired, and the field names, the field descriptions and the field values are spliced aiming at the single field, so that the field splicing result of the fields in the metadata set is obtained, the classification and classification of the metadata set are realized, and the efficiency and the accuracy of the classification and classification of the metadata set are improved; in addition, compared with the content data set, the field splicing result of the metadata set is higher in generation efficiency, and the flexibility of data classification and grading is improved.
S103, obtaining rule category, rule level and field category of at least one rule in the rule table; wherein the rules are used for determining the processing mode of the classified data set.
The rule table can be used as a basis for classifying and grading data. The rule table may be from a national standard file or a line standard file. Alternatively, different industries may have respective rule tables. Alternatively, the rule table may be pre-stored in the database. The rule table may include at least one rule. Rules may be used to determine the manner in which the hierarchical data set is to be classified. One rule may include a rule category, a rule level, and a field category. Rule categories may be used to characterize categories of business data divisions for a particular industry. Essentially, the division of rule categories is based on national standard documents or line standard documents. The rule level may be used to characterize the importance of the rule, the security privacy level, or the impact on the business system. One rule category may correspond to one rule level. The processing modes of the business data corresponding to different rule levels are different. Illustratively, the rule level may be "1, 2, 3. The smaller the value of the rule level, the higher the security privacy level requirement of the rule level. For example, the rule level 3 service data does not need to be processed additionally; the business data with rule level of 2 needs to be desensitized; the service data with rule level 1 needs to be desensitized and encrypted. The field class may be a further exemplary description of a rule class. Illustratively, the rule category may be "authentication assistance information"; the rule level may be 3; the field categories can be "dynamic password", "verification code", "password prompt question and answer question", voiceprint password ", and the like.
Specifically, the database may be queried to obtain a rule category, a rule level, and a field category of at least one rule in the rule table.
S104, aiming at a single field, calculating the similarity between the field and each rule according to the field name, the field description, the corresponding field value, the rule category and the field category of each rule of the field to obtain at least one rule with higher similarity corresponding to the field, and determining the data category and the data level of the classified data set.
The data category may be a category to which the hierarchical data set to be classified belongs. The data level may be the level to which the hierarchical data set to be classified belongs. The data level may be used to characterize the degree of importance, the degree of security privacy or the degree of impact on the business system of the hierarchical data set to be classified. Different data levels have different ways of handling traffic data.
Specifically, for a single field, the field name, the field description and the corresponding field value can be vectorized to obtain a field vector corresponding to the field. The rule category and the field category of each rule can be vectorized to obtain a rule vector corresponding to each rule. The similarity between the field vector and each rule vector can be calculated, the similarity between the single field vector and each rule vector is ordered, at least one rule with the highest order is selected, and at least one rule with higher similarity corresponding to the field is determined. Alternatively, the similarity between the field and each rule splicing result can be calculated by using the manners of euclidean distance, manhattan distance or cosine distance and the like.
By way of example, the following formula may be used to calculate the field vector:
wherein V is field Is a field vector; vector is a function for vectorizing field names, field descriptions, and corresponding field values of fields; f [ name ]]+F[desc]Combined content for field name and field description; f is a single field containing a field nameA name and field description, possibly containing a field value; f [ name ]]Is a field name; f [ desc ]]Is a field description.
The rule vector may be calculated using the following formula:
wherein V is spec Is a rule vector; vectorzon is a function that vectorizes rule categories and field categories of rules; t field]+T[cate]Combined content for rule category and field category; t is a single rule, and comprises a rule category and a field category; t field]Is a rule category; t [ cat ]]Is a field class.
The similarity between the field vector and the rule vector may be calculated using the following formula:
wherein S is i,j Cosine similarity between the ith field vector and the jth rule vector; cosSim is a function of the cosine similarity calculation; v (V) field,i Is the i-th field vector; v (V) spec,j Is the j-th rule vector.
The following formula may be used to calculate K rules with higher similarity corresponding to the fields:
Wherein C is i K rules with higher similarity corresponding to the ith field; TOPK j The function of K rules with higher similarity is screened; s is S i,j Is the cosine similarity between the i-th field vector and the j-th rule vector.
Specifically, rule category and rule level of at least one rule with higher similarity corresponding to each field can be obtained. Each rule category may be determined directly as a data category of the hierarchical data set to be classified. The rule level corresponding to each rule category can be directly determined as the data level of the hierarchical data set to be classified. Therefore, the field values corresponding to the fields in the classified data set can be processed according to the data levels corresponding to the fields in the classified data set.
Optionally, the ratio of each rule category and rule level corresponding to the classified data set may be further compared, and the rule category and corresponding rule level with the largest ratio are determined as the data category and data level of the classified data set. Therefore, the classified data sets can be processed according to the processing modes corresponding to the data levels of the classified data sets.
In an optional embodiment of the present invention, for a single field, calculating, according to a field name, a field description, a corresponding field value, a rule category and a field category of each rule of the field, a similarity between the field and each rule, to obtain at least one rule with a higher similarity corresponding to the field, includes: aiming at a single field, splicing the field name, the field description and the field value to obtain a field splicing result of the field; aiming at a single field, acquiring a field classification prompt word, and carrying out semantic understanding on a field splicing result by adopting a large language model according to the field classification prompt word to obtain an initial category of the single field; aiming at a single rule, splicing the rule category and the field category to obtain a rule splicing result corresponding to the rule; and calculating the similarity between the initial category and each rule splicing result aiming at the initial category of the single field to obtain at least one rule with higher similarity corresponding to the field.
The field splice result may be used to represent field content of a single field of the content data set. Wherein the field name may be used to identify a single field and a field value corresponding to the single field. The field description may be used to characterize the properties of the field values corresponding to the individual fields. The field value may be used to characterize the specific content to which a single field corresponds.
The rule splice result may be used to represent the specific content of an individual rule in the rule table. Wherein rule categories may be used to identify categories to which individual rules correspond. The field class may be a further exemplary description of a rule class.
The initial category may be the result of the large language model performing an initial classification on the hierarchical dataset to be classified. The field classification prompt word can be used for prompting the large language model to infer the initial category of a single field corresponding to the field splicing result according to the content of the field splicing result. Optionally, the field classification hint may include a field splice result and a field classification requirement. For example, a field classification hint may be "to which category the following fields belong: { field } "; where { field } is the field concatenation result. Alternatively, the field classification prompt may be pre-generated and stored by a technician and may be invoked when the data classification is performed.
Specifically, for a single field, a content separator is adopted to splice a field name, a field description and a field value, so as to obtain a field splicing result of the field. Alternatively, a single field may correspond to at least one field value, with a field value separator separating the field values further.
Exemplary, the content separator may be ""; the field value separator may be "|"; the field name may be "HY"; the field description may be "industry information"; the field values may include "energy", "catering", "service" and "mining". The field splice result may be "field name { HY } +.>Field description { industry information }>The field value { energy |restaurant|service |mining }).
Specifically, for a single field, a pre-generated field classification prompt may be obtained. The large language model can be guided according to the field classification prompt words, the field splicing results are classified, the category of the field splicing results is obtained, and the category is determined to be the initial category of the single field.
Specifically, a rule separator can be adopted for a single rule, and the rule category and the field category are spliced to obtain a rule splicing result corresponding to the rule. Alternatively, a single rule may correspond to at least one field category, and a field category separator may be employed to further separate the field categories. Alternatively, the rule separators may be the same as or different from the content separators. The field value separator and the field class separator may be the same or different. And in particular to settings and adjustments based on the experience of the skilled person.
For example, the rule separators may be ""; the field class separator may be "|"; the rule category may be "authentication auxiliary information"; the field categories may be "dynamic password", "verification code", "password hint question and answer", and "voiceprint password". The rule concatenation result may be "rule category { authentication auxiliary information })>The field class { dynamic password |verification code|password hint question-answering question|voiceprint password }.
Specifically, for the initial category of a single field, vectorization can be performed on the initial category to obtain a field vector corresponding to the initial category. And vectorizing each rule splicing result to obtain a rule vector corresponding to the rule splicing result. The similarity between the field vector and each rule vector can be calculated, the similarity between the single field vector and each rule vector is ordered, at least one rule with the highest order is selected, and at least one rule with higher similarity corresponding to the field is determined. Alternatively, the similarity between the field and each rule splicing result can be calculated by using the manners of euclidean distance, manhattan distance or cosine distance and the like.
By way of example, the following formula may be used to calculate the field vector:
Wherein V is field Is a field vector; vector is a function for vectorizing field splicing results; LLM is a pair field name F [ name ]]Sum field description F [ desc ]]A function for semantic understanding of the field splicing result of (a); LLM (F [ name)]+F[desc]) An initial category that is a single field; f [ name ]]+F[desc]The field splicing result is obtained; f is a single field, contains a field name and a field description, and possibly contains a field value; f [ name ]]Is a field name; f [ desc ]]Is a field description.
According to the scheme, the field splicing result and the rule splicing result are generated, and the data classification and classification are carried out by utilizing the splicing result, so that the data classification and classification efficiency is further improved; meanwhile, the field splicing results are initially classified through the large language model, and similarity matching is carried out between the initial category and the rule splicing results, so that data classification and classification are realized, and the accuracy of data classification and classification is further improved.
In the prior art, the classification and grading mode of data generally adopts the technologies of keywords, regular expressions, data dictionaries and the like, and combines manual verification to realize classification and grading of data. However, the existing mode can only classify and classify the data set with obvious format rules, the limitation of data classification and classification is large, the data classification and classification depends on manpower, and the efficiency and accuracy of the data classification and classification are difficult to guarantee.
According to the technical scheme of the embodiment of the invention, when the classified data set is the content data set, the field description prompt word is obtained for a single field by obtaining the field name and the corresponding field value of the classified data set, the field value is semantically understood by adopting a large language model according to the field description prompt word to obtain the field description of the field value, the rule category, the rule level and the field category of at least one rule in the rule table are obtained for the single field, according to field names, field descriptions, corresponding field values, rule categories of all rules and field categories of the fields, calculating the similarity between the fields and all rules to obtain at least one rule with higher similarity corresponding to the fields, determining the data category and the data level of the classified data set, introducing a large language model, being capable of identifying word frequency rules in the field values, ensuring the accuracy of the determined field descriptions, solving the problem of limitation of data classification and classification, and improving the generalization of a data classification and classification method; meanwhile, a rule table is introduced, so that classification and grading of the classified and graded data set are realized, and a data processing mode of the classified and graded data set is determined; in addition, on the basis of the rule table, a large language model is applied to the data classification task, so that accurate data classification is realized, the efficiency and accuracy of data classification are improved, and the accurate processing of service data in a service system is ensured.
Example two
Fig. 2 is a flowchart of a data classification and classification method according to a second embodiment of the present invention. Based on the embodiment, the embodiment of the invention embodies the generation process of the training sample of the large language model into the field category of at least one rule in the acquired rule table; aiming at single field category, obtaining field value to generate prompt words, generating prompt words according to the field value, and carrying out semantic understanding on the field category by adopting a large language model to obtain a preset number of field values corresponding to the single field category; generating a field name corresponding to the field category; according to the names of the fields and the corresponding preset number of field values, training samples of a large language model are generated, and the accuracy of field description is further improved. In the embodiments of the present invention, the descriptions of other embodiments may be referred to in the portions not described in detail.
Referring to the data classification and classification method shown in fig. 2, the method includes:
s201, acquiring a field name and a corresponding field value of at least one field in the hierarchical data set to be classified.
S202, when the hierarchical data set to be classified is a content data set, acquiring field summary prompt words for single fields, and carrying out semantic understanding on field values by adopting a large language model according to the field summary prompt words to obtain field description summary text of the field values.
The large language model can convert natural language tasks into dialogue tasks, and a set of model structures can be utilized to match different prompt words to cope with various task types. The field description summary text may be a summary text of attributes of field values corresponding to the same field name. The field summary prompt word can be used for prompting the large language model to infer the summary text of the attribute corresponding to the field value according to the content of the field value. Alternatively, the field summary prompt may include text that needs to be summarized (i.e., the field value corresponding to the individual field) and the summary requirement. For example, a field summary prompt may be "please infer the attributes, types, or uses of the following text contents, summarized as a sentence: { Text } "; the Text is Text to be summarized, namely a field value corresponding to a single field. For example, the field summary prompt may be "please infer the attribute, type, or purpose of the following text contents, summarized as a sentence, without outputting the text contents: { Text } ", by adding" please not output Text content ", the repeated matching of field values caused by Text content (i.e. field values) in the field description summary Text can be avoided, the efficiency of similarity matching is affected, and the efficiency of data classification and grading is further improved.
Specifically, for a single field, a pre-generated field summary prompt may be obtained. The large language model can be guided according to the field summary prompt words, semantic understanding is carried out on the field values, and the field description summary text of the field values is inferred.
S203, aiming at a single field, acquiring a keyword extraction prompt word, extracting the prompt word according to the keyword, and carrying out semantic understanding on the field description summary text by adopting a large language model to obtain at least one field description keyword of the field description summary text.
The field description keywords may be keywords extracted from the field description summary text. The field description key may be used to characterize the attribute of the field value. The keyword extraction prompt word can be used for prompting the large language model to summarize the text content according to the field description, and deduce the attribute corresponding to the field value to obtain the field description keyword. Optionally, the field describesThe at least one field descriptive keyword of the summary text may be in the form of a keyword list. Alternatively, the keyword extraction hint word may include example text, a keyword list of example text, a field description summary text, and a keyword extraction requirement. By way of example, the keyword extraction hint may be "suppose you are a data engineer, please refer to the example for generating its keywords from text, each keyword length is no more than four words, Examples: />Text: { Eg_text } -, a->Keyword: { Eg_Kw }> Text: { Text }>Keyword: "; wherein { eg_text } is any example Text manually selected; { eg_kw } is a keyword list of the manually annotated example text; wherein, each keyword can be separated by a pause; { Text } is Text that requires extraction of keywords, i.e., field description summary Text. Optionally, the field summary prompt word and the keyword extraction prompt word may be pre-generated and stored by a technician, and may be invoked when performing data classification and classification.
Specifically, for a single field, a keyword extraction prompt word generated in advance may be obtained. And extracting the prompt word according to the keyword, guiding the large language model, carrying out semantic understanding on the field description summary text, and extracting at least one field description keyword of the field description summary text.
S204, determining each field description keyword as field description of a field value.
Specifically, each field description keyword may be determined as a field description of a field value.
S205, obtaining rule category, rule level and field category of at least one rule in the rule table; wherein the rules are used for determining the processing mode of the classified data set.
S206, aiming at a single field, calculating the similarity between the field and each rule according to the field name, the field description, the corresponding field value, the rule category and the field category of each rule of the field to obtain at least one rule with higher similarity corresponding to the field, and determining the data category and the data level of the classified data set.
According to the technical scheme, the field summary prompt words and the keyword extraction prompt words are introduced, the field description summary is firstly carried out on the field values, then the keyword extraction is carried out on the field description summary text, and the accuracy of the field description is further improved by utilizing the reasoning capability of the large language model.
Optionally, the embodiment of the invention also provides a data classification grading system. The data classification and grading system comprises a data management module, a prompt word management module, a model management module, a classification and grading module (namely the device), a data generation module and a model training module.
And the data management module is used for storing the service data as a training sample of the classified hierarchical data set or the large language model. The service data may be data imported by a user of the service system.
And the prompt word management module is used for storing the prompt words of the large language model and guiding the large language model to execute different natural language tasks. The prompt word of the large language model can be expert experience imported by a user of the service system. By way of example, the hints of the large language model may include field description hints, field summary hints, keyword extraction hints, field classification hints, field value generation hints, and the like.
And the model management module is used for storing large language model structure definitions and parameters used in classification and grading.
And the industry rule management module is used for storing the basis of data classification and grading, namely a rule table. The rule table may include three columns, rule category, rule description, and rule level, respectively. The rule table is from a national standard file or a line standard file. Alternatively, different industries have respective rule tables.
And the classification and grading module is used for calling the plurality of modules and realizing the whole flow of the data classification and grading task. The classification and grading module comprises a content summarization sub-module, a keyword extraction sub-module, a vector extraction sub-module and a recall sequence calculation sub-module.
The content summarization sub-module is used for calling the large language model in the model management module and the field summarization prompt words in the prompt word management module, summarizing the categories or rules of the field values, and outputting a sentence as a field description summarization text of the field values. For example, the field summary prompt may be "please infer the type and purpose of the text, summarize into a sentence, and do not output text content: { text } "; wherein, { text } is the text to be summarized, i.e. the field value corresponding to a single field.
The keyword extraction sub-module is used for calling the large language model in the model management module and extracting the prompting words from the keywords in the prompting word management module, extracting the keywords in the input text (field description summary text), and outputting a keyword list (namely at least one field description keyword). For example, the keyword extraction hint word is "suppose you are a data engineer, please refer to the example to generate its hint words from text, each hint word length is no more than four words. Examples: />Text: { eg_text }>Keyword: { eg_kw }> Text: { text } ->Keyword: "; wherein { eg_text } is any example text selected manually, and { eg_kw } is a keyword list of example text marked manually, each keyword can be separated by a pause number; { text } is a text that requires extraction of keywords, i.e., a field description summary text.
The vector extraction sub-module is used for converting the input text (namely the field splicing result (or the initial category) and the rule splicing result) into a vector with semantics. Thus, the cosine measure can be satisfied, i.e. the cosine value between two vectors, i.e. the similarity between two texts, can be calculated.
And the recall sequence calculation sub-module is used for receiving the two groups of vectors, namely the field vector and the rule vector, and calculating the cosine similarity of each field vector and all the rule vectors. And then, the submodule filters out rule vectors with the similarity lower than a preset threshold value, and sequences the rule vectors to obtain a plurality of rule vectors with the highest similarity.
Fig. 3 is a flowchart of a data classification and classification method according to an embodiment of the present invention. Based on the above embodiment, fig. 3 is a preferred embodiment of the present invention.
Referring to the data classification and classification method shown in fig. 3, the method includes:
s301, extracting a hierarchical data set to be classified from a data management module; wherein the hierarchical data set to be classified comprises a content data set or a metadata set.
Wherein the content data set may be defined as a table in a structured database. Each column of the content data set is a respective field value belonging to the same field name; the contents of each row in the same column except for the field name are one field value. The metadata set may be defined as metadata of the content data set. The metadata set may contain at least two columns, namely a field name column and a field description column; wherein, the field name column is used for storing the field name; the field description column is used to store the field description. The single metadata includes a field name and a field description.
S302, when the hierarchical data set to be classified is a content data set, sampling field values in the content data set aiming at single fields to obtain at least one field value sampling result; inputting the single field value sampling result into a content summarization sub-module to obtain a field description summarization text of a single field; and inputting the field description summary text into a keyword extraction sub-module to obtain at least one field description keyword of the field description summary text, and determining the field description as a field value.
S303, aiming at a single field, splicing the field name, the field description and the field value sampling result to obtain a field splicing result.
Illustratively, the splicing may be performed as follows, to obtain a field splicing result: name: { name }Description of: { desc }>The content is as follows: { val } "; wherein { name } is a field name; { desc } is a field description; { val } is the field value sampling result; if the metadata set is, the field value sampling result is null.
S304, inputting the field splicing result into a large language model, and obtaining a field classification prompt word to obtain an initial category of the field splicing result.
For example, a field classification hint may be "to which category the following fields belong: { field } "where { field } is the word field concatenation result.
S305, inputting the initial categories of the fields into a vector extraction submodule respectively to obtain field vectors of each field.
S306, extracting a rule form from an industry rule management module; wherein the rule table includes rule categories, rule descriptions, and rule levels for at least one rule.
Alternatively, the field category may be parsed from the rule description for each rule, resulting in at least one rule triplet of rules, namely rule category, rule level and field category.
S307, splicing the rule category and the field category in the rule triplet of each rule to obtain a rule splicing result; and inputting the rule splicing result into a vector extraction submodule to obtain a rule vector of each rule.
S308, inputting field vectors of all fields and rule vectors of all rules into a recall ordering calculation submodule to obtain a plurality of rules which are matched with each field.
S309, obtaining the data category and the data level of the classified data set to be classified according to the rule category and the rule level of the most matched rules.
The scheme improves generalization, efficiency and accuracy of the data classification and classification method, and ensures accurate processing of service data in a service system.
Example III
Fig. 4 is a flowchart of a data classification and classification method according to a third embodiment of the present invention. The embodiment of the invention further increases the generation process of the training sample of the large language model based on the embodiment, and specifically comprises the steps of acquiring the field category of at least one rule in the rule table; aiming at single field category, obtaining field value to generate prompt words, generating prompt words according to the field value, and carrying out semantic understanding on the field category by adopting a large language model to obtain a preset number of field values corresponding to the single field category; generating a field name corresponding to the field category; according to the field names and the corresponding preset number of field values, training samples of the large language model are generated, the problem that the training samples of the large language model are insufficient due to the privacy of the service system is avoided, the training samples are expanded through the field types in the disclosed rule table, and the information safety of the service system and the accuracy of training of the large language model are considered. In the embodiments of the present invention, the descriptions of other embodiments may be referred to in the portions not described in detail.
Referring to the data classification and classification method shown in fig. 4, the method includes:
S401, obtaining a field category of at least one rule in the rule table.
The rule table can be used as a basis for classifying and grading data. The rule table may be from a national standard file or a line standard file. Alternatively, different industries may have respective rule tables. Alternatively, the rule table may be pre-stored in the database. The rule table may include at least one rule. One rule may include a rule category, a rule level, and a field category. The field class may be a further exemplary description of a rule class.
Specifically, the database may be queried to directly obtain a field class of at least one rule in the rule table.
In an alternative embodiment of the present invention, obtaining a field class of at least one rule in a rule table includes: acquiring rule description of at least one rule in a rule table; splitting according to a preset field category identifier and a preset separator aiming at a single rule description to obtain at least one field category corresponding to the rule description; and determining the field category of at least one rule in the rule table according to the field category corresponding to each rule description.
Optionally, the rule table may also include rule categories, rule levels, and rule descriptions. Wherein the rule description may be used to further describe the rule category. For example, the rule description may include an explanation of the rule category and an exemplary description. The preset field category identifier may be an identifier of a content location of a field category in a preset rule description. By way of example, the preset field class identifier may be "such as … …, etc. The preset separator may be a separator between field categories in a preset rule description. Illustratively, the preset separator may be ",".
Specifically, the database may be queried to obtain a rule description of at least one rule in the rule table. The content location of the field category in the rule description may be determined for a single rule description according to a preset field category identification. And splitting the content contained in the content position of the field category in the rule description according to the preset separator to obtain at least one field category corresponding to the rule description. And determining the field category corresponding to each rule description as the field category of at least one rule in the rule table.
According to the method and the device, the rule description is split through the preset field category identification and the preset separator, so that the field category of the rule table is obtained, the relevance between the field category and the rule category is further improved, and the comprehensiveness and the accuracy of the determined field category are improved.
S402, aiming at single field types, acquiring field values to generate prompt words, generating the prompt words according to the field values, and carrying out semantic understanding on the field types by adopting a large language model to obtain the preset number of field values corresponding to the single field types.
The field value generation prompt word can be used for prompting the large language model to expand the field category according to the content of the field category, and a plurality of field values corresponding to the field category are generated. Alternatively, the field value generation hint word may include a field category and a preset number. For example, the field value generation hint word may be "please generate 10: { fields } "; where { fields } is a field class. Optionally, the field value generation prompting word may further include a rule category, at least one field category corresponding to the rule category, a field value line feed separator, and a preset number. Exemplary, the field value generation hint word may be "please generate 10 { cat }, including { fields } Format: />Sequence number: />{ fmt } "; wherein { cate } is a rule category; { fields } is a field class; { fmt } is a field value line feed separator; for example, "line feed+rule category: ". Alternatively, the field value generation hint word may be pre-generated and stored by a technician and may be invoked when the field value is generated.
Specifically, for a single field category, a pre-generated field value may be obtained to generate a hint word. The method can generate prompt words according to the field values, guide a large language model, perform semantic understanding on the field categories, and expand and generate a preset number of field values corresponding to single field categories.
S403, generating a field name corresponding to the field category.
The field name may be used to identify field values that belong to the same attribute. Because of the different configuration modes of different service systems, field names belonging to the same attribute may be different in different service systems.
Specifically, a rule category or rule level corresponding to a field category in the rule table may be obtained, and determined as a field name corresponding to the field category.
Optionally, the field name corresponding to the configuration mode of any service system may be randomly generated based on the configuration mode of each service system.
S404, generating training samples of the large language model according to the field names and the corresponding preset number of field values, so as to pre-train the large language model through the training samples.
The training samples may be used to train a large language model. The training samples may be content data set samples. Optionally, the training samples may include at least one field name and corresponding field value.
Specifically, each field name may be used as a first row (or a first column) of the training sample, and a preset number of field values corresponding to the field names may be used as contents of the same column (or the same row) to which the field names belong, so as to generate the training sample of the large language model. The large language model may be trained by training samples.
S405, acquiring a field name and a corresponding field value of at least one field in the hierarchical data set to be classified.
S406, when the hierarchical data set to be classified is a content data set, a field description prompt word is obtained for a single field, and according to the field description prompt word, a large language model is adopted to carry out semantic understanding on the field value, so as to obtain the field description of the field value.
S407, obtaining rule category, rule level and field category of at least one rule in the rule table; wherein the rules are used for determining the processing mode of the classified data set.
S408, calculating the similarity between the field and each rule according to the field name, the field description, the corresponding field value, the rule category and the field category of each rule of the field aiming at the single field, obtaining at least one rule with higher similarity corresponding to the field, and determining the data category and the data level of the classified data set.
According to the technical scheme, the field class of at least one rule in the rule table is obtained, the field value is obtained for a single field class, the prompting word is generated according to the field value, the field class is subjected to semantic understanding by adopting a large language model, the preset number of field values corresponding to the single field class are obtained, the field name corresponding to the field class is generated, the training sample of the large language model is generated according to each field name and the corresponding preset number of field values, the problem that the training sample of the large language model is insufficient due to the privacy of a service system is avoided, the expansion of the training sample is realized through the field class in the disclosed rule table, and the information safety of the service system and the accuracy of training of the large language model are both considered.
Optionally, the data classification grading system further comprises a data generation module and a model training module.
The data generation module is used for obtaining a rule table from the industry rule management module, analyzing field names contained in the rule table, generating a preset number of field values by combining the field values in the prompt word management module to generate the prompt words, and storing the generated field values and the corresponding field names in the data management module to serve as training samples. Thus, the problem of insufficient training data of the large language model in the initial period can be solved.
The model training module is used for obtaining training samples from the data management module, obtaining prompt words of the large language model from the prompt word management module, obtaining structural definition and parameters of the large language model from the model management module, and training the model to execute data classification and grading.
Fig. 5 is a flowchart of a data classification and classification method according to an embodiment of the present invention. On the basis of the above embodiment, fig. 5 is a preferred embodiment of the present invention.
Referring to the data classification and classification method shown in fig. 5, the method includes:
s501, extracting a rule form from an industry rule management module; wherein the rule table includes rule categories, rule descriptions, and rule levels for at least one rule.
S502, analyzing the field category from the rule description of each rule.
In particular, content within "e.g.. Et cetera" in the rule description may be intercepted. The content can be split according to the pause number to obtain at least one field category corresponding to the rule description.
S503, loading field values from the prompt word management module to generate the prompt word.
Exemplary, the field value generation hint word may be "please generate 10 { cat }, including { fields } Format: />Sequence number: />{ fmt } "; wherein { cate } is a rule category; { fields } is a field class separated by a pause; { fmt } is the field class separated by a line feedFollow the colon.
S504, inputting field names and field values contained in the rule table into a data generation module to generate prompt words, generating a corresponding content data set, determining the content data set as a training sample, and storing the training sample in a data management module.
S505, calling a classification and grading module to classify and grade the newly generated content data set, and storing the question-answer history record of the large language model based on the prompt words in the data management module.
S506, manually checking the newly generated question and answer history, and checking whether the field description summary text, the field description keywords and the initial category which are inferred by the large language model are reasonable and reliable.
S507, training a large language model by adopting a back propagation and gradient descent method according to the verified question-answer history, and storing the trained parameters into a model management module to obtain an improved large language model.
S508, classifying and grading the data set to be classified by using the improved large language model, and returning to execute S506 and S507 after classifying and grading is finished, so that the quality of the large language model is continuously improved.
According to the scheme, the large language model is adopted, the field types are expanded, corresponding training samples are generated, the influence on the training accuracy of the large language model due to the fact that the data privacy of a service system or the data shortage of the service system is avoided, the accuracy of the large language model is improved, meanwhile, after the data are classified, the large language model is continuously trained, and the accuracy of the large language model is further improved.
Example IV
Fig. 6 is a schematic structural diagram of a data classification and classification device according to a fourth embodiment of the present invention. The embodiment of the invention is applicable to the situation of classifying and grading the service data of the service system, the device can execute a data classifying and grading method, the device can be realized in a hardware and/or software form, and the device can be configured in electronic equipment carrying the data classifying and grading function.
Referring to the data classification and classification apparatus shown in fig. 6, comprising: the data classification and classification module comprises a to-be-classified classification data set acquisition module 601, a first field description determination module 602, a rule category acquisition module 603 and a data classification and classification module 604. The to-be-classified hierarchical data set obtaining module 601 is configured to obtain a field name and a corresponding field value of at least one field in the to-be-classified hierarchical data set; the first field description determining module 602 is configured to obtain, for a single field, a field description prompt word when the hierarchical dataset to be classified is a content dataset, and perform semantic understanding on the field value by using a large language model according to the field description prompt word, so as to obtain a field description of the field value; a rule category acquiring module 603, configured to acquire a rule category, a rule level, and a field category of at least one rule in the rule table; the rule is used for determining a processing mode of the classified data set to be classified; the data classification and classification module 604 is configured to calculate, for a single field, a similarity between the field and each rule according to a field name, a field description, a corresponding field value, a rule category and a field category of each rule of the field, and obtain at least one rule with a higher similarity corresponding to the field, and determine a data category and a data level of the classified data set.
According to the technical scheme of the embodiment of the invention, when the classified data set is the content data set, the field description prompt word is obtained for a single field by obtaining the field name and the corresponding field value of the classified data set, the field value is semantically understood by adopting a large language model according to the field description prompt word to obtain the field description of the field value, the rule category, the rule level and the field category of at least one rule in the rule table are obtained for the single field, according to field names, field descriptions, corresponding field values, rule categories of all rules and field categories of the fields, calculating the similarity between the fields and all rules to obtain at least one rule with higher similarity corresponding to the fields, determining the data category and the data level of the classified data set, introducing a large language model, being capable of identifying word frequency rules in the field values, ensuring the accuracy of the determined field descriptions, solving the problem of limitation of data classification and classification, and improving the generalization of a data classification and classification method; meanwhile, a rule table is introduced, so that classification and grading of the classified and graded data set are realized, and a data processing mode of the classified and graded data set is determined; in addition, on the basis of the rule table, a large language model is applied to the data classification task, so that accurate data classification is realized, the efficiency and accuracy of data classification are improved, and the accurate processing of service data in a service system is ensured.
In an alternative embodiment of the present invention, the first field description determination module 602 includes: the field description summary text generation unit is used for acquiring field summary prompt words for single fields, carrying out semantic understanding on field values by adopting a large language model according to the field summary prompt words, and obtaining field description summary text of the field values; the method comprises the steps of generating a primitive for field description keywords, acquiring keyword extraction prompt words for single fields, extracting the prompt words according to the keywords, and carrying out semantic understanding on field description summary texts by adopting a large language model to obtain at least one field description keyword of the field description summary texts; and the first field description determining unit is used for determining each field description keyword as a field description of a field value.
In an alternative embodiment of the present invention, the data classification ranking module 604 includes: the first field splicing result determining unit is used for splicing the field name, the field description and the field value aiming at the single field to obtain a field splicing result of the field; the initial category determining module is used for acquiring a field classification prompt word for a single field after the field name, the field description and the field value are spliced for the single field to obtain a field splicing result of the field, and carrying out semantic understanding on the field splicing result by adopting a large language model according to the field classification prompt word to obtain an initial category of the single field; the rule splicing result determining unit is used for splicing the rule category, the rule level and the field category aiming at a single rule to obtain a rule splicing result corresponding to the rule; the rule similarity matching unit is used for calculating the similarity between the initial category and each rule splicing result according to the initial category of the single field to obtain at least one rule with higher similarity corresponding to the field.
In an alternative embodiment of the present invention, the apparatus further comprises a training sample generation module; wherein, training sample generation module includes: a field category acquiring unit, configured to acquire a field category of at least one rule in the rule table; the field value generation unit is used for acquiring field values to generate prompt words aiming at single field types, generating the prompt words according to the field values, carrying out semantic understanding on the field types by adopting a large language model, and obtaining a preset number of field values corresponding to the single field types; a field name generating unit, configured to generate a field name corresponding to the field category; the training sample generation unit is used for generating training samples of the large language model according to the field names and the corresponding preset number of field values so as to pre-train the large language model through the training samples.
In an alternative embodiment of the present invention, the field class obtaining unit includes: a rule description acquisition subunit, configured to acquire a rule description of at least one rule in the rule table; the first field category determining subunit is used for splitting a single rule description according to a preset field category identifier and a preset separator to obtain at least one field category corresponding to the rule description; and the second field category determining subunit is used for determining the field category of at least one rule in the rule table according to the field category corresponding to each rule description.
In an alternative embodiment of the present invention, the first field description determination module 602 includes: the field value sampling result determining unit is used for sampling the field values contained in the content data set aiming at the single field to obtain a field value sampling result corresponding to the field; the second field description determining unit is used for acquiring field description prompt words for single fields, and carrying out semantic understanding on the field value sampling result by adopting a large language model according to the field description prompt words to obtain the field description of the field value.
In an alternative embodiment of the invention, the apparatus further comprises: the second field description determining module is used for acquiring a field name and a corresponding field description for a single field when the hierarchical data set to be classified is a metadata set before acquiring the rule category, the rule level and the field category of at least one rule in the rule table; the second field splicing result determining module is used for splicing the field name, the field description and the field value aiming at the single field to obtain a field splicing result of the field; wherein the field value is null.
The data classification and classification device provided by the embodiment of the invention can execute the data classification and classification method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
According to the technical scheme, the acquisition, storage, application and the like of at least one field name, a corresponding field value, a field description prompt word, a rule category, a rule level, a field category, a rule description, a field summarization prompt word, a keyword extraction prompt word, a field classification prompt word, a field value generation prompt word and the like of at least one rule in a rule table in the related hierarchical data set to be classified accord with the regulations of related laws and regulations, and the rules and regulations are not violated.
Example IV
Fig. 7 shows a schematic diagram of an electronic device 700 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 7, the electronic device 700 includes at least one processor 701, and a memory, such as a Read Only Memory (ROM) 702, a Random Access Memory (RAM) 703, etc., communicatively connected to the at least one processor 701, in which the memory stores a computer program executable by the at least one processor, and the processor 701 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 702 or the computer program loaded from the storage unit 708 into the Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The processor 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.
The processor 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 701 performs the various methods and processes described above, such as the data classification method.
In some embodiments, the data classification ranking method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When a computer program is loaded into RAM 703 and executed by processor 701, one or more steps of the data classification ranking method described above may be performed. Alternatively, in other embodiments, the processor 701 may be configured to perform the data classification ranking method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS (Virtual Private Server ) service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of classifying and ranking data, the method comprising:
acquiring a field name and a corresponding field value of at least one field in the hierarchical data set to be classified;
when the hierarchical data set to be classified is a content data set, acquiring a field description prompt word for a single field, and carrying out semantic understanding on the field value by adopting a large language model according to the field description prompt word to obtain the field description of the field value;
Acquiring rule category, rule level and field category of at least one rule in the rule table; the rule is used for determining a processing mode of the grading data set to be classified;
and calculating the similarity between the field and each rule according to the field name, the field description, the corresponding field value, the rule category and the field category of each rule of the field aiming at the single field to obtain at least one rule with higher similarity corresponding to the field, and determining the data category and the data level of the classified data set to be classified.
2. The method of claim 1, wherein the obtaining the field description hint word for a single field and performing semantic understanding on the field value according to the field description hint word using a large language model to obtain the field description of the field value includes:
aiming at a single field, acquiring a field summary prompt word, and carrying out semantic understanding on the field value by adopting a large language model according to the field summary prompt word to obtain a field description summary text of the field value;
aiming at a single field, acquiring a keyword extraction prompt word, extracting the prompt word according to the keyword, and carrying out semantic understanding on the field description summary text by adopting a large language model to obtain at least one field description keyword of the field description summary text;
And determining each field description keyword as the field description of the field value.
3. The method according to claim 1, wherein for each of the fields, calculating the similarity between the field and each of the rules according to the field name, the field description, the corresponding field value, the rule category and the field category of each of the rules, to obtain at least one rule with a higher similarity corresponding to the field includes:
aiming at the single field, splicing the field name, the field description and the field value to obtain a field splicing result of the field;
aiming at a single field, acquiring a field classification prompt word, and carrying out semantic understanding on the field splicing result by adopting a large language model according to the field classification prompt word to obtain the initial category of the single field;
aiming at a single rule, splicing the rule category and the field category to obtain a rule splicing result corresponding to the rule;
and calculating the similarity between the initial category and each rule splicing result aiming at the initial category of the single field to obtain at least one rule with higher similarity corresponding to the field.
4. The method of claim 1, wherein the process of generating training samples for the large language model comprises:
acquiring a field category of at least one rule in a rule table;
aiming at a single field category, acquiring a field value to generate a prompt word, generating the prompt word according to the field value, and carrying out semantic understanding on the field category by adopting a large language model to obtain a preset number of field values corresponding to the single field category;
generating a field name corresponding to the field category;
and generating training samples of the large language model according to the field names and the corresponding preset number of field values so as to pre-train the large language model through the training samples.
5. The method of claim 4, wherein the obtaining a field class of at least one rule in the rule table comprises:
acquiring rule description of at least one rule in a rule table;
splitting the rule description according to a preset field category identifier and a preset separator to obtain at least one field category corresponding to the rule description;
and determining the field category of at least one rule in the rule table according to the field category corresponding to each rule description.
6. The method of claim 1, wherein the obtaining the field description hint word for a single field and performing semantic understanding on the field value according to the field description hint word using a large language model to obtain the field description of the field value includes:
sampling field values contained in the content data set aiming at a single field to obtain a field value sampling result corresponding to the field;
and aiming at a single field, acquiring a field description prompt word, and carrying out semantic understanding on the field value sampling result by adopting a large language model according to the field description prompt word to obtain the field description of the field value.
7. The method of claim 1, further comprising, prior to the obtaining the rule category, rule level, and field category of at least one rule in the rule table:
when the hierarchical data set to be classified is a metadata set, aiming at a single field, acquiring a field name and a corresponding field description;
aiming at the single field, splicing the field name, the field description and the field value to obtain a field splicing result of the field; wherein the field value is null.
8. A data classification and ranking apparatus, the apparatus comprising:
the classified data set acquisition module is used for acquiring a field name and a corresponding field value of at least one field in the classified data set;
the first field description determining module is used for acquiring field description prompt words for a single field when the hierarchical data set to be classified is a content data set, and carrying out semantic understanding on the field values by adopting a large language model according to the field description prompt words to obtain field descriptions of the field values;
the rule category acquisition module is used for acquiring rule categories, rule levels and field categories of at least one rule in the rule table; the rule is used for determining a processing mode of the grading data set to be classified;
and the data classification and grading module is used for calculating the similarity between the field and each rule according to the field name, the field description, the corresponding field value, the rule category and the field category of each rule of the field to obtain at least one rule with higher similarity corresponding to the field, and determining the data category and the data grade of the classified data set to be classified.
9. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the data classification method of any of claims 1-7.
10. A computer readable storage medium storing computer instructions for causing a processor to perform the data classification method of any of claims 1-7.
CN202410044807.9A 2024-01-12 2024-01-12 Data classification and classification method and device, electronic equipment and storage medium Active CN117556050B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410044807.9A CN117556050B (en) 2024-01-12 2024-01-12 Data classification and classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410044807.9A CN117556050B (en) 2024-01-12 2024-01-12 Data classification and classification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117556050A true CN117556050A (en) 2024-02-13
CN117556050B CN117556050B (en) 2024-04-12

Family

ID=89823674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410044807.9A Active CN117556050B (en) 2024-01-12 2024-01-12 Data classification and classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117556050B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180014198A1 (en) * 2015-01-20 2018-01-11 Samsung Electronics Co., Ltd. Apparatus and method for enhancing personal information data security
CN114021184A (en) * 2021-10-28 2022-02-08 深圳乐信软件技术有限公司 Data management method and device, electronic equipment and storage medium
CN114154198A (en) * 2021-12-03 2022-03-08 建信金融科技有限责任公司 Data processing method and device
CN115438129A (en) * 2022-09-30 2022-12-06 深圳市梦网视讯有限公司 Structured data classification method and device and terminal equipment
CN115688737A (en) * 2022-11-07 2023-02-03 北京航空航天大学 Paper cold start disambiguation method based on feature extraction and fusion
CN115767550A (en) * 2022-12-02 2023-03-07 北京亚鸿世纪科技发展有限公司 Network risk assessment method and device for 5G private network
CN116975400A (en) * 2023-08-03 2023-10-31 星环信息科技(上海)股份有限公司 Data hierarchical classification method and device, electronic equipment and storage medium
CN117313683A (en) * 2023-11-15 2023-12-29 中国联合网络通信集团有限公司 Metadata processing method, device, server and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180014198A1 (en) * 2015-01-20 2018-01-11 Samsung Electronics Co., Ltd. Apparatus and method for enhancing personal information data security
CN114021184A (en) * 2021-10-28 2022-02-08 深圳乐信软件技术有限公司 Data management method and device, electronic equipment and storage medium
CN114154198A (en) * 2021-12-03 2022-03-08 建信金融科技有限责任公司 Data processing method and device
CN115438129A (en) * 2022-09-30 2022-12-06 深圳市梦网视讯有限公司 Structured data classification method and device and terminal equipment
CN115688737A (en) * 2022-11-07 2023-02-03 北京航空航天大学 Paper cold start disambiguation method based on feature extraction and fusion
CN115767550A (en) * 2022-12-02 2023-03-07 北京亚鸿世纪科技发展有限公司 Network risk assessment method and device for 5G private network
CN116975400A (en) * 2023-08-03 2023-10-31 星环信息科技(上海)股份有限公司 Data hierarchical classification method and device, electronic equipment and storage medium
CN117313683A (en) * 2023-11-15 2023-12-29 中国联合网络通信集团有限公司 Metadata processing method, device, server and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈菲琪,吴昊,王巍,黄进: "中小商业银行数据安全体系建设探索与实践", 《数字通信世界》, 31 May 2023 (2023-05-31) *

Also Published As

Publication number Publication date
CN117556050B (en) 2024-04-12

Similar Documents

Publication Publication Date Title
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN116795973B (en) Text processing method and device based on artificial intelligence, electronic equipment and medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN112468659B (en) Quality evaluation method, device, equipment and storage medium applied to telephone customer service
CN110427612B (en) Entity disambiguation method, device, equipment and storage medium based on multiple languages
CN112417102A (en) Voice query method, device, server and readable storage medium
US20220391426A1 (en) Multi-system-based intelligent question answering method and apparatus, and device
US20220261545A1 (en) Systems and methods for producing a semantic representation of a document
CN111414746A (en) Matching statement determination method, device, equipment and storage medium
EP4141697A1 (en) Method and apparatus of processing triple data, method and apparatus of training triple data processing model, device, and medium
CN112990035A (en) Text recognition method, device, equipment and storage medium
CN114003682A (en) Text classification method, device, equipment and storage medium
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN114116997A (en) Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN117556050B (en) Data classification and classification method and device, electronic equipment and storage medium
KR20220024251A (en) Method and apparatus for building event library, electronic device, and computer-readable medium
CN115292506A (en) Knowledge graph ontology construction method and device applied to office field
CN113641778A (en) Topic identification method for dialog text
CN115600580B (en) Text matching method, device, equipment and storage medium
CN114201607B (en) Information processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant