CN113705728B

CN113705728B - Classification and classification list intelligent marking method

Info

Publication number: CN113705728B
Application number: CN202111102610.9A
Authority: CN
Inventors: 卢红波; 张林成
Original assignee: Quanzhi Technology Hangzhou Co ltd
Current assignee: Quanzhi Technology Hangzhou Co ltd
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2023-08-01
Anticipated expiration: 2041-09-18
Also published as: CN113705728A

Abstract

The invention discloses an intelligent marking method for a classified and graded list, relates to the technical field of communication, and solves the technical problems that at present, under the large data magnitude, the marking of a data list is time-consuming and labor-consuming, and the marking quality is low. According to the technical scheme, table and field information are read through different databases, text line formats are unified, then preprocessing, chinese-English word segmentation and English translation operations are carried out, text rough classification is firstly carried out based on content diversity and complexity of the text lines, text lines with marked overall results are obtained, and training and short text classification are further carried out based on fastText models. The invention has high-speed and high-quality models in short text classification, trains and tests the text lines to obtain reasonable marking results of the text lines, and performs intelligent marking, thereby saving time and labor.

Description

Classification and classification list intelligent marking method

Technical Field

The invention relates to the technical field of communication, in particular to an intelligent marking method for a classified and graded list.

Background

Short text classification has been widely used in scenes such as public opinion classification, news classification, etc.; in the field of data security, classification and grading of data lists encounters a huge bottleneck in implementation; the data list is generally composed of databases with different storage modes, tables with different naming forms and field information, and the orders of magnitude are tens of thousands to millions; in the face of such magnitude of table and field information, it is an obstacle to classifying the marking, which presents challenges to the number and expertise of the marking personnel, and it takes days to months; at the same time, the quality of the marking can also become an important issue.

Disclosure of Invention

The invention aims to provide an intelligent marking method for a classification and grading list, which is used for classifying and grading by classifying short texts, extracting keywords of the short texts, obtaining vectorized short texts, clustering the short texts and classifying the short texts, is intelligent and reasonable in marking, saves time and labor and improves marking quality.

In order to achieve the above object, the present invention provides the following technical solutions: an intelligent marking method for a classified and graded list comprises the following steps:

s1, reading tables and field information of different databases, and processing the tables and the field information into text lines with uniform formats, wherein the contents of the text lines comprise field names, field notes, table names and table notes; simultaneously reading all the tags; preprocessing a text line, including deactivating words and punctuation marks; then, performing Chinese and English word segmentation on the text line, and performing Chinese word segmentation on the tag; the English after word segmentation is translated into corresponding Chinese through a word stock in English translation; at this time, a TEXT line of the word segmentation is obtained, the TEXT is recorded, the TEXT line is divided into FIELD information and TABLE information according to the content of the TEXT line, and the text_field and the text_table are recorded respectively; the LABEL of the word is divided into a LABEL and a background LABEL according to whether the specific field is the specific field, and LABEL and label_BG are respectively recorded; and assuming that the number of tags is class_num; so far, the text line and the label are segmented;

s2, matching text and labels; traversing the TEXT, traversing the LABEL for the text_field of each TEXT, and recording the TEXT FIELD information and the number of words matched with the LABEL to obtain a list with the length of CLASS_NUM; the maximum value in the record list is MATCH_MAX; according to whether MATCH_MAX is unique, the following 2 cases are classified:

s2.1. The MATCH_MAX is unique, and the label of the TEXT coarse classification is the label corresponding to the MATCH_MAX;

s2.2.MATCH_MAX is not unique, all tags with the matching number of MATCH_MAX are obtained, MATCH_MAX_LABEL is recorded, the number of the tags is MATCH_MAX_LABEL_NUM, TEXT_TABLE of TEXT is traversed, the number of words matched with TEXT TABLE information and background tags is recorded, and a list with the length of MATCH_MAX_LABEL_NUM is obtained; the maximum value in the record list is MATCH_TABLE_MAX; according to whether MATCH_TABLE_MAX is unique, the following 2 cases are also classified:

s2.2.1.match_table_max is unique, and the tag of the TEXT coarse classification is the tag corresponding to match_table_max;

s2.2.2.MATCH_TABLE_MAX is not unique, all the LABELs with the matching number of MATCH_TABLE_MAX in S2.2 are obtained, MATCH_TABLE_MAX_LABEL is recorded, TEXT_FIELD of TEXT is traversed, LABEL with the matching number of FIELDs and the proportion of the matching number to the total number of LABEL are recorded, MATCH_CHAR and MATCH_CHAR_RATIO are respectively recorded, the VALUEs of MATCH_CHAR+MATCH_CHAR_RATIO are recorded as matching VALUEs MATCH_VALUE, a list of matching VALUEs is obtained, and the LABEL corresponding to the highest matching VALUE is selected as a rough-classification LABEL; so far, according to the field information matching number, the table information matching number and the priority of the matching word number, the rough classification of most text lines is completed;

s3, according to actual conditions, the situation that the number of text lines corresponding to some labels in the S2 text rough classification is extremely small is dealt with, and a small amount of data expansion is carried out on the text lines;

s4, training a model based on fastText and classifying short texts; for the short text in S3, converting the text line into a format according to the read-in format of the fastText algorithm, such as index conversion of labels; and setting a THRESHOLD of confidence, and marking a text line as a corresponding label when the confidence exceeds 0.9, for example, by threshold=0.9; and traversing TEXT, and marking the TEXT according to the comparison of the confidence result and the THRESHOLD.

In step S4, the fastText-based model includes an N-gram and a hierarchical softmax.

The invention describes an intelligent marking method for a classified and graded list, wherein the text of the classified and graded list is usually composed of tables and fields with different rules and naming modes, the marking rate of rule strategy matching is low, and meanwhile, the misjudgment is higher; since text quality is usually not high, clustering results are also not ideal; the text is roughly classified by using the text-label matching method as a standard classification method: performing rough classification on the text lines according to the following priorities, wherein the number of field information matching words is greater than the number of table information matching words and greater than the number of field information matching words; the coarse classification result is subjected to fastText learning and testing, so that the marking rate is greatly improved compared with the prior art, and the misjudgment rate is also remarkably reduced; the matching words in the text rough classification are obtained by a test mode, namely experience; part of LABELs have the problems of overlong length, redundant content and the like, so that the LABELs are often provided with a plurality of words of other LABELs at the same time, which can obstruct the normal rough classification process, and therefore, the misjudgment bottleneck in rough classification of texts is solved by taking the weighted sum of the number of words matched by a field and the proportion of the number of matched words to the total number of LABEL as the classification standard; and the intelligent marking is realized, so that time and labor are saved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a flow chart of the present invention.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings.

The invention provides an intelligent marking method for a classified and graded list as shown in figure 1,

1. firstly, reading tables and field information of different databases, and processing the tables and the field information into text lines with uniform formats, wherein the contents of the text lines comprise field names, field notes, table names and table notes 4 parts; simultaneously reading all the tags; preprocessing a text line, including deactivating words and punctuation marks; then, performing Chinese and English word segmentation on the text line, and performing Chinese word segmentation on the tag; the English after word segmentation is translated into corresponding Chinese through a word stock in English translation; at this time, a TEXT line of the word segmentation is obtained, the TEXT is recorded, the TEXT line is divided into FIELD information and TABLE information according to the content of the TEXT line, and the text_field and the text_table are recorded respectively; the LABEL of the word is divided into a LABEL and a background LABEL according to whether the specific field is the specific field, and LABEL and label_BG are respectively recorded; and assuming that the number of tags is class_num; so far, the text line and the label are segmented;

2. secondly, text coarse classification: text-tag matching; traversing the TEXT, traversing the LABEL for the text_field of each TEXT, and recording the TEXT FIELD information and the number of words matched with the LABEL to obtain a list with the length of CLASS_NUM; the maximum value in the record list is MATCH_MAX; according to whether MATCH_MAX is unique, the following 2 cases are classified; MATCH_MAX is unique, and the label of the TEXT coarse classification is the label corresponding to MATCH_MAX; acquiring all LABELs with matching number of MATCH_MAX, recording MATCH_MAX_LABEL, recording the number of the LABELs, recording MATCH_MAX_LABEL_NUM, traversing TEXT_TABLE in TEXT_MAX_LABEL, recording TEXT TABLE information and the number of words matched with background LABELs, and obtaining a list with length of MATCH_MAX_LABEL_NUM; the maximum value in the record list is MATCH_TABLE_MAX; according to whether MATCH_TABLE_MAX is unique, the following 2 cases are also classified; the only one MATCH_TABLE_MAX, the label of the TEXT rough classification is the label corresponding to MATCH_TABLE_MAX; acquiring all LABELs with the matching number of MATCH_TABLE_MAX in 2.2, recording MATCH_TABLE_MAX_LABEL, traversing LABEL of TEXT_TABLE_MAX_LABEL, recording the word number of FIELD matching and the proportion of the matching word number to the total word number of LABEL, recording the VALUEs of MATCH_CHAR and MATCH_CHAR_RATIO respectively, recording the VALUEs of MATCH_CHAR+MATCH_CHAR_RATIO as matching VALUEs, obtaining a list of matching VALUEs, and selecting the LABEL corresponding to the highest matching VALUE as a rough-classified LABEL; so far, according to the field information matching number, the table information matching number and the priority of the matching word number, the rough classification of most text lines is completed;

3. according to the actual situation, a small amount of data augmentation can be performed on text lines corresponding to some labels in the text rough classification under the condition that the number of the text lines is very small;

4. again, fastText-based model training and short text classification; fastttext is a model with high speed and high precision in text classification; the model is simple in structure, and N-gram and layering softmax are used in the implementation process, wherein the accuracy is improved, and the speed is improved; on the premise that the accuracy and the deep learning are quite equivalent, the training and testing speed is higher than that of the training and testing by several orders of magnitude; for the short text in 3, converting the text line into a format according to the reading format of the fastText algorithm, such as index conversion of labels; and setting a THRESHOLD of confidence, and marking a text line as a corresponding label when the confidence exceeds 0.9, for example, by threshold=0.9; traversing TEXT, and marking a TEXT according to the comparison of the confidence result and the THRESHOLD;

5. so far, the invention completes the whole process from reading the table and the field information to marking; the technical problem that marking of the classified list is time-consuming and low-efficient is solved.

While certain exemplary embodiments of the present invention have been described above by way of illustration only, it will be apparent to those of ordinary skill in the art that modifications may be made to the described embodiments in various different ways without departing from the spirit and scope of the invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive of the scope of the invention, which is defined by the appended claims.

Claims

1. An intelligent marking method for classified and graded lists is characterized in that,

s2.1, the MATCH_MAX is unique, and the label of the TEXT rough classification is the label corresponding to the MATCH_MAX;

s2.2, acquiring all LABELs with matching number of MATCH_MAX, recording MATCH_MAX_LABEL, recording the number of the LABELs, recording MATCH_MAX_LABEL_NUM, traversing TEXT_TABLE in TEXT_MAX_LABEL, recording the number of words matched with TEXT TABLE information and background LABELs, and obtaining a list with length of MATCH_MAX_LABEL_NUM; the maximum value in the record list is MATCH_TABLE_MAX; according to whether MATCH_TABLE_MAX is unique, the following 2 cases are also classified:

s2.2.1 the MATCH_TABLE_MAX is unique, and the tags of the TEXT coarse classification are the tags corresponding to MATCH_TABLE_MAX;

s2.2.2 all the LABELs with MATCH number MATCH_TABLE_MAX in S2.2 are obtained, MATCH_TABLE_MAX_LABEL is recorded, TEXT_FIELD of TEXT is traversed, LABEL of MATCH_TABLE_MAX_LABEL is traversed, the 'number of words matched with FIELDs' and the 'proportion of the number of matching words to the total number of LABEL' are recorded, MATCH_CHAR and MATCH_CHAR_RATIO are recorded, the VALUEs of MATCH_CHAR+MATCH_CHAR_RATIO are recorded as MATCH VALUEs MATCH_VALUE, a list of MATCH VALUEs is obtained, and the LABEL corresponding to the highest MATCH VALUE is selected as the LABEL of rough classification; so far, according to the field information matching number, the table information matching number and the priority of the matching word number, the rough classification of most text lines is completed;

s3, according to actual conditions, dealing with the situation that the number of text lines corresponding to some labels in the S2 text rough classification is extremely small, and carrying out small data augmentation on the text lines;

s4, training a model based on fastText and classifying short texts; for the short text in S3, converting the text line according to the read-in format of the fastText algorithm; setting a THRESHOLD of the confidence, and recording THRESHOLD; and traversing TEXT, and marking the TEXT according to the comparison of the confidence result and the THRESHOLD.

2. The intelligent marking method for the classified list according to claim 1, wherein: in step S4, the fastText-based model includes an N-gram and a hierarchical softmax.