CN113705728B - Classification and classification list intelligent marking method - Google Patents

Classification and classification list intelligent marking method Download PDF

Info

Publication number
CN113705728B
CN113705728B CN202111102610.9A CN202111102610A CN113705728B CN 113705728 B CN113705728 B CN 113705728B CN 202111102610 A CN202111102610 A CN 202111102610A CN 113705728 B CN113705728 B CN 113705728B
Authority
CN
China
Prior art keywords
text
match
label
max
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111102610.9A
Other languages
Chinese (zh)
Other versions
CN113705728A (en
Inventor
卢红波
张林成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Quanzhi Technology Hangzhou Co ltd
Original Assignee
Quanzhi Technology Hangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Quanzhi Technology Hangzhou Co ltd filed Critical Quanzhi Technology Hangzhou Co ltd
Priority to CN202111102610.9A priority Critical patent/CN113705728B/en
Publication of CN113705728A publication Critical patent/CN113705728A/en
Application granted granted Critical
Publication of CN113705728B publication Critical patent/CN113705728B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses an intelligent marking method for a classified and graded list, relates to the technical field of communication, and solves the technical problems that at present, under the large data magnitude, the marking of a data list is time-consuming and labor-consuming, and the marking quality is low. According to the technical scheme, table and field information are read through different databases, text line formats are unified, then preprocessing, chinese-English word segmentation and English translation operations are carried out, text rough classification is firstly carried out based on content diversity and complexity of the text lines, text lines with marked overall results are obtained, and training and short text classification are further carried out based on fastText models. The invention has high-speed and high-quality models in short text classification, trains and tests the text lines to obtain reasonable marking results of the text lines, and performs intelligent marking, thereby saving time and labor.

Description

Classification and classification list intelligent marking method
Technical Field
The invention relates to the technical field of communication, in particular to an intelligent marking method for a classified and graded list.
Background
Short text classification has been widely used in scenes such as public opinion classification, news classification, etc.; in the field of data security, classification and grading of data lists encounters a huge bottleneck in implementation; the data list is generally composed of databases with different storage modes, tables with different naming forms and field information, and the orders of magnitude are tens of thousands to millions; in the face of such magnitude of table and field information, it is an obstacle to classifying the marking, which presents challenges to the number and expertise of the marking personnel, and it takes days to months; at the same time, the quality of the marking can also become an important issue.
Disclosure of Invention
The invention aims to provide an intelligent marking method for a classification and grading list, which is used for classifying and grading by classifying short texts, extracting keywords of the short texts, obtaining vectorized short texts, clustering the short texts and classifying the short texts, is intelligent and reasonable in marking, saves time and labor and improves marking quality.
In order to achieve the above object, the present invention provides the following technical solutions: an intelligent marking method for a classified and graded list comprises the following steps:
s1, reading tables and field information of different databases, and processing the tables and the field information into text lines with uniform formats, wherein the contents of the text lines comprise field names, field notes, table names and table notes; simultaneously reading all the tags; preprocessing a text line, including deactivating words and punctuation marks; then, performing Chinese and English word segmentation on the text line, and performing Chinese word segmentation on the tag; the English after word segmentation is translated into corresponding Chinese through a word stock in English translation; at this time, a TEXT line of the word segmentation is obtained, the TEXT is recorded, the TEXT line is divided into FIELD information and TABLE information according to the content of the TEXT line, and the text_field and the text_table are recorded respectively; the LABEL of the word is divided into a LABEL and a background LABEL according to whether the specific field is the specific field, and LABEL and label_BG are respectively recorded; and assuming that the number of tags is class_num; so far, the text line and the label are segmented;
s2, matching text and labels; traversing the TEXT, traversing the LABEL for the text_field of each TEXT, and recording the TEXT FIELD information and the number of words matched with the LABEL to obtain a list with the length of CLASS_NUM; the maximum value in the record list is MATCH_MAX; according to whether MATCH_MAX is unique, the following 2 cases are classified:
s2.1. The MATCH_MAX is unique, and the label of the TEXT coarse classification is the label corresponding to the MATCH_MAX;
s2.2.MATCH_MAX is not unique, all tags with the matching number of MATCH_MAX are obtained, MATCH_MAX_LABEL is recorded, the number of the tags is MATCH_MAX_LABEL_NUM, TEXT_TABLE of TEXT is traversed, the number of words matched with TEXT TABLE information and background tags is recorded, and a list with the length of MATCH_MAX_LABEL_NUM is obtained; the maximum value in the record list is MATCH_TABLE_MAX; according to whether MATCH_TABLE_MAX is unique, the following 2 cases are also classified:
s2.2.1.match_table_max is unique, and the tag of the TEXT coarse classification is the tag corresponding to match_table_max;
s2.2.2.MATCH_TABLE_MAX is not unique, all the LABELs with the matching number of MATCH_TABLE_MAX in S2.2 are obtained, MATCH_TABLE_MAX_LABEL is recorded, TEXT_FIELD of TEXT is traversed, LABEL with the matching number of FIELDs and the proportion of the matching number to the total number of LABEL are recorded, MATCH_CHAR and MATCH_CHAR_RATIO are respectively recorded, the VALUEs of MATCH_CHAR+MATCH_CHAR_RATIO are recorded as matching VALUEs MATCH_VALUE, a list of matching VALUEs is obtained, and the LABEL corresponding to the highest matching VALUE is selected as a rough-classification LABEL; so far, according to the field information matching number, the table information matching number and the priority of the matching word number, the rough classification of most text lines is completed;
s3, according to actual conditions, the situation that the number of text lines corresponding to some labels in the S2 text rough classification is extremely small is dealt with, and a small amount of data expansion is carried out on the text lines;
s4, training a model based on fastText and classifying short texts; for the short text in S3, converting the text line into a format according to the read-in format of the fastText algorithm, such as index conversion of labels; and setting a THRESHOLD of confidence, and marking a text line as a corresponding label when the confidence exceeds 0.9, for example, by threshold=0.9; and traversing TEXT, and marking the TEXT according to the comparison of the confidence result and the THRESHOLD.
In step S4, the fastText-based model includes an N-gram and a hierarchical softmax.
The invention describes an intelligent marking method for a classified and graded list, wherein the text of the classified and graded list is usually composed of tables and fields with different rules and naming modes, the marking rate of rule strategy matching is low, and meanwhile, the misjudgment is higher; since text quality is usually not high, clustering results are also not ideal; the text is roughly classified by using the text-label matching method as a standard classification method: performing rough classification on the text lines according to the following priorities, wherein the number of field information matching words is greater than the number of table information matching words and greater than the number of field information matching words; the coarse classification result is subjected to fastText learning and testing, so that the marking rate is greatly improved compared with the prior art, and the misjudgment rate is also remarkably reduced; the matching words in the text rough classification are obtained by a test mode, namely experience; part of LABELs have the problems of overlong length, redundant content and the like, so that the LABELs are often provided with a plurality of words of other LABELs at the same time, which can obstruct the normal rough classification process, and therefore, the misjudgment bottleneck in rough classification of texts is solved by taking the weighted sum of the number of words matched by a field and the proportion of the number of matched words to the total number of LABEL as the classification standard; and the intelligent marking is realized, so that time and labor are saved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.
FIG. 1 is a flow chart of the present invention.
Detailed Description
In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings.
The invention provides an intelligent marking method for a classified and graded list as shown in figure 1,
1. firstly, reading tables and field information of different databases, and processing the tables and the field information into text lines with uniform formats, wherein the contents of the text lines comprise field names, field notes, table names and table notes 4 parts; simultaneously reading all the tags; preprocessing a text line, including deactivating words and punctuation marks; then, performing Chinese and English word segmentation on the text line, and performing Chinese word segmentation on the tag; the English after word segmentation is translated into corresponding Chinese through a word stock in English translation; at this time, a TEXT line of the word segmentation is obtained, the TEXT is recorded, the TEXT line is divided into FIELD information and TABLE information according to the content of the TEXT line, and the text_field and the text_table are recorded respectively; the LABEL of the word is divided into a LABEL and a background LABEL according to whether the specific field is the specific field, and LABEL and label_BG are respectively recorded; and assuming that the number of tags is class_num; so far, the text line and the label are segmented;
2. secondly, text coarse classification: text-tag matching; traversing the TEXT, traversing the LABEL for the text_field of each TEXT, and recording the TEXT FIELD information and the number of words matched with the LABEL to obtain a list with the length of CLASS_NUM; the maximum value in the record list is MATCH_MAX; according to whether MATCH_MAX is unique, the following 2 cases are classified; MATCH_MAX is unique, and the label of the TEXT coarse classification is the label corresponding to MATCH_MAX; acquiring all LABELs with matching number of MATCH_MAX, recording MATCH_MAX_LABEL, recording the number of the LABELs, recording MATCH_MAX_LABEL_NUM, traversing TEXT_TABLE in TEXT_MAX_LABEL, recording TEXT TABLE information and the number of words matched with background LABELs, and obtaining a list with length of MATCH_MAX_LABEL_NUM; the maximum value in the record list is MATCH_TABLE_MAX; according to whether MATCH_TABLE_MAX is unique, the following 2 cases are also classified; the only one MATCH_TABLE_MAX, the label of the TEXT rough classification is the label corresponding to MATCH_TABLE_MAX; acquiring all LABELs with the matching number of MATCH_TABLE_MAX in 2.2, recording MATCH_TABLE_MAX_LABEL, traversing LABEL of TEXT_TABLE_MAX_LABEL, recording the word number of FIELD matching and the proportion of the matching word number to the total word number of LABEL, recording the VALUEs of MATCH_CHAR and MATCH_CHAR_RATIO respectively, recording the VALUEs of MATCH_CHAR+MATCH_CHAR_RATIO as matching VALUEs, obtaining a list of matching VALUEs, and selecting the LABEL corresponding to the highest matching VALUE as a rough-classified LABEL; so far, according to the field information matching number, the table information matching number and the priority of the matching word number, the rough classification of most text lines is completed;
3. according to the actual situation, a small amount of data augmentation can be performed on text lines corresponding to some labels in the text rough classification under the condition that the number of the text lines is very small;
4. again, fastText-based model training and short text classification; fastttext is a model with high speed and high precision in text classification; the model is simple in structure, and N-gram and layering softmax are used in the implementation process, wherein the accuracy is improved, and the speed is improved; on the premise that the accuracy and the deep learning are quite equivalent, the training and testing speed is higher than that of the training and testing by several orders of magnitude; for the short text in 3, converting the text line into a format according to the reading format of the fastText algorithm, such as index conversion of labels; and setting a THRESHOLD of confidence, and marking a text line as a corresponding label when the confidence exceeds 0.9, for example, by threshold=0.9; traversing TEXT, and marking a TEXT according to the comparison of the confidence result and the THRESHOLD;
5. so far, the invention completes the whole process from reading the table and the field information to marking; the technical problem that marking of the classified list is time-consuming and low-efficient is solved.
While certain exemplary embodiments of the present invention have been described above by way of illustration only, it will be apparent to those of ordinary skill in the art that modifications may be made to the described embodiments in various different ways without departing from the spirit and scope of the invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive of the scope of the invention, which is defined by the appended claims.

Claims (2)

1. An intelligent marking method for classified and graded lists is characterized in that,
s1, reading tables and field information of different databases, and processing the tables and the field information into text lines with uniform formats, wherein the contents of the text lines comprise field names, field notes, table names and table notes; simultaneously reading all the tags; preprocessing a text line, including deactivating words and punctuation marks; then, performing Chinese and English word segmentation on the text line, and performing Chinese word segmentation on the tag; the English after word segmentation is translated into corresponding Chinese through a word stock in English translation; at this time, a TEXT line of the word segmentation is obtained, the TEXT is recorded, the TEXT line is divided into FIELD information and TABLE information according to the content of the TEXT line, and the text_field and the text_table are recorded respectively; the LABEL of the word is divided into a LABEL and a background LABEL according to whether the specific field is the specific field, and LABEL and label_BG are respectively recorded; and assuming that the number of tags is class_num; so far, the text line and the label are segmented;
s2, matching text and labels; traversing the TEXT, traversing the LABEL for the text_field of each TEXT, and recording the TEXT FIELD information and the number of words matched with the LABEL to obtain a list with the length of CLASS_NUM; the maximum value in the record list is MATCH_MAX; according to whether MATCH_MAX is unique, the following 2 cases are classified:
s2.1, the MATCH_MAX is unique, and the label of the TEXT rough classification is the label corresponding to the MATCH_MAX;
s2.2, acquiring all LABELs with matching number of MATCH_MAX, recording MATCH_MAX_LABEL, recording the number of the LABELs, recording MATCH_MAX_LABEL_NUM, traversing TEXT_TABLE in TEXT_MAX_LABEL, recording the number of words matched with TEXT TABLE information and background LABELs, and obtaining a list with length of MATCH_MAX_LABEL_NUM; the maximum value in the record list is MATCH_TABLE_MAX; according to whether MATCH_TABLE_MAX is unique, the following 2 cases are also classified:
s2.2.1 the MATCH_TABLE_MAX is unique, and the tags of the TEXT coarse classification are the tags corresponding to MATCH_TABLE_MAX;
s2.2.2 all the LABELs with MATCH number MATCH_TABLE_MAX in S2.2 are obtained, MATCH_TABLE_MAX_LABEL is recorded, TEXT_FIELD of TEXT is traversed, LABEL of MATCH_TABLE_MAX_LABEL is traversed, the 'number of words matched with FIELDs' and the 'proportion of the number of matching words to the total number of LABEL' are recorded, MATCH_CHAR and MATCH_CHAR_RATIO are recorded, the VALUEs of MATCH_CHAR+MATCH_CHAR_RATIO are recorded as MATCH VALUEs MATCH_VALUE, a list of MATCH VALUEs is obtained, and the LABEL corresponding to the highest MATCH VALUE is selected as the LABEL of rough classification; so far, according to the field information matching number, the table information matching number and the priority of the matching word number, the rough classification of most text lines is completed;
s3, according to actual conditions, dealing with the situation that the number of text lines corresponding to some labels in the S2 text rough classification is extremely small, and carrying out small data augmentation on the text lines;
s4, training a model based on fastText and classifying short texts; for the short text in S3, converting the text line according to the read-in format of the fastText algorithm; setting a THRESHOLD of the confidence, and recording THRESHOLD; and traversing TEXT, and marking the TEXT according to the comparison of the confidence result and the THRESHOLD.
2. The intelligent marking method for the classified list according to claim 1, wherein: in step S4, the fastText-based model includes an N-gram and a hierarchical softmax.
CN202111102610.9A 2021-09-18 2021-09-18 Classification and classification list intelligent marking method Active CN113705728B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111102610.9A CN113705728B (en) 2021-09-18 2021-09-18 Classification and classification list intelligent marking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111102610.9A CN113705728B (en) 2021-09-18 2021-09-18 Classification and classification list intelligent marking method

Publications (2)

Publication Number Publication Date
CN113705728A CN113705728A (en) 2021-11-26
CN113705728B true CN113705728B (en) 2023-08-01

Family

ID=78661364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111102610.9A Active CN113705728B (en) 2021-09-18 2021-09-18 Classification and classification list intelligent marking method

Country Status (1)

Country Link
CN (1) CN113705728B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522298A (en) * 2018-08-29 2019-03-26 云南电网有限责任公司信息中心 Data cleaning method for CIM
CN109766410A (en) * 2019-01-07 2019-05-17 东华大学 A kind of newsletter archive automatic classification system based on fastText algorithm
CN110059181A (en) * 2019-03-18 2019-07-26 中国科学院自动化研究所 Short text stamp methods, system, device towards extensive classification system
WO2020215457A1 (en) * 2019-04-26 2020-10-29 网宿科技股份有限公司 Adversarial learning-based text annotation method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7783637B2 (en) * 2003-09-30 2010-08-24 Microsoft Corporation Label system-translation of text and multi-language support at runtime and design

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522298A (en) * 2018-08-29 2019-03-26 云南电网有限责任公司信息中心 Data cleaning method for CIM
CN109766410A (en) * 2019-01-07 2019-05-17 东华大学 A kind of newsletter archive automatic classification system based on fastText algorithm
CN110059181A (en) * 2019-03-18 2019-07-26 中国科学院自动化研究所 Short text stamp methods, system, device towards extensive classification system
WO2020215457A1 (en) * 2019-04-26 2020-10-29 网宿科技股份有限公司 Adversarial learning-based text annotation method and device

Also Published As

Publication number Publication date
CN113705728A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
CN107766371B (en) Text information classification method and device
CN109933796B (en) Method and device for extracting key information of bulletin text
CN112131920A (en) Data structure generation for table information in scanned images
CN112732934A (en) Power grid equipment word segmentation dictionary and fault case library construction method
CN111209728B (en) Automatic labeling and inputting method for test questions
CN111061882A (en) Knowledge graph construction method
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN112036184A (en) Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model
CN115034218A (en) Chinese grammar error diagnosis method based on multi-stage training and editing level voting
CN114090736A (en) Enterprise industry identification system and method based on text similarity
CN115292490A (en) Analysis algorithm for policy interpretation semantics
CN114896369A (en) Fault recording file channel name identification method based on incremental learning optimization
CN109446522B (en) Automatic test question classification system and method
CN111178080A (en) Named entity identification method and system based on structured information
Lehenmeier et al. Layout detection and table recognition–recent challenges in digitizing historical documents and handwritten tabular data
CN113705728B (en) Classification and classification list intelligent marking method
CN117034948A (en) Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion
CN112036330A (en) Text recognition method, text recognition device and readable storage medium
CN112395858A (en) Multi-knowledge point marking method and system fusing test question data and answer data
TW200409046A (en) Optical character recognition device, document searching system, and document searching program
CN109543038A (en) A kind of sentiment analysis method applied to text data
CN111400606B (en) Multi-label classification method based on global and local information extraction
CN114356924A (en) Method and apparatus for extracting data from structured documents
CN114564942A (en) Text error correction method, storage medium and device for supervision field
CN112632282A (en) Chinese and English thesis data classification and query method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant