CN113822026A - Multi-label entity labeling method - Google Patents

Multi-label entity labeling method Download PDF

Info

Publication number
CN113822026A
CN113822026A CN202111062720.7A CN202111062720A CN113822026A CN 113822026 A CN113822026 A CN 113822026A CN 202111062720 A CN202111062720 A CN 202111062720A CN 113822026 A CN113822026 A CN 113822026A
Authority
CN
China
Prior art keywords
entity
label
labeling
model
prediction model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111062720.7A
Other languages
Chinese (zh)
Other versions
CN113822026B (en
Inventor
张传锋
朱锦雷
井焜
张琨
潘玲玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Synthesis Electronic Technology Co Ltd
Original Assignee
Synthesis Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Synthesis Electronic Technology Co Ltd filed Critical Synthesis Electronic Technology Co Ltd
Priority to CN202111062720.7A priority Critical patent/CN113822026B/en
Publication of CN113822026A publication Critical patent/CN113822026A/en
Application granted granted Critical
Publication of CN113822026B publication Critical patent/CN113822026B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a multi-label labeling method which mainly comprises a multi-label entity labeling processing flow, a multi-label entity prediction model based on deep learning and a feedback type model optimization framework based on online error correction. Compared with the existing entity labeling method, on one hand, the method provided by the invention can realize labeling extraction of specific information in the tax field, on the other hand, the automatic labeling model provided by the invention can solve the problem that the same character string is endowed with multiple entity labels, and in addition, the real-time feedback type model optimization framework provided by the invention provides a feasible scheme for iterative evolution of the model, so that the model is gradually optimized in each interaction, and the method has important practical value.

Description

Multi-label entity labeling method
Technical Field
The invention relates to the field of natural language processing, in particular to the field of data annotation, and specifically relates to a multi-label entity annotation method.
Background
The entity labeling is one of the important links of the unstructured data structuring, the core entity is extracted from the unstructured text through the entity labeling, and the core entity is stored to form structured knowledge. The economic scale of China is huge, tax payment subjects and scenes are various, tax regulations of all countries and places are frequently updated, and a scheme capable of replacing experts to automatically read the tax laws is urgently needed. The existing entity labeling method depends on manpower in large quantity, each entity is only endowed with one label, the contents of the entities are not overlapped, the regulation in the tax field is large in quantity, the phenomenon of content overlapping of the entities exists in large quantity, and a method capable of labeling tax texts with multiple labels is needed.
The patent named entity labeling method facing military corpora (publication number CN 111428502A) mainly performs integrated learning on LSTM, Lattice LSTM and BERT models through an XGboost algorithm, and obtains the named entity in the military field through model prediction and manual confirmation. A patent "a closed-loop entity extraction method based on automatic sample labeling" (publication No. CN 111125378A) proposes a closed-loop workflow, and the difficulty of manual labeling can be reduced by performing entity extraction through the workflow. A model-assisted data annotation system and an annotation method (publication number CN 110880021A) provide a model-assisted data annotation method based on a computer vision recognition technology, which is mainly applied to the field of intelligent annotation of images and improves the annotation efficiency of personnel. The patent "multi-label entity-relationship joint extraction method based on deep neural network and labeling strategy" (publication number CN 109543183A) mainly proposes a method for extracting multi-label relationships by using a GRU network, which directly extracts related entity pairs from a text end to end, and omits a large number of independent entities without relationships.
Disclosure of Invention
In order to realize the extraction of structured information of a text and solve the problem of content overlapping among a plurality of entities, the invention provides a multi-label multi-annotation method, which can reduce the burden of manpower annotation and improve the efficiency of entity annotation.
In order to solve the technical problem, the sampling technical scheme of the invention is as follows: a multi-label entity labeling method comprises the following steps:
s01), acquiring text content, and constructing a database based on the text content;
s02), defining N entities which are used most frequently and have the most use value in the field based on text content, wherein N is a positive integer, and constructing corresponding labels according to the defined N entities;
s03), cleaning and performing association sequencing on the text content, wherein the association sequencing is sequencing according to time;
s04), the cleaned text is used as a parameter and is transmitted to a multi-label entity prediction model based on deep learning, and the multi-label entity prediction model automatically labels the character sequence and is called as pre-labeling; different from a single-label entity labeling model, the multi-label entity prediction model endows the same character string with more than one entity label, does not pay attention to entity relationships, and labels the entities even if the entities do not have relationships with other entities;
s05), carrying out manual review on the result of the pre-labeling by a feedback type model optimization framework based on online error correction, and scoring the generalization ability of the multi-label entity prediction model according to the difference between the result of the manual review and the prediction result of the multi-label entity prediction model so as to optimize the multi-label entity prediction model;
s06), carrying out post-processing on the original labeling result after manual review, thereby extracting entity information and storing the entity information in a database.
Further, the process of pre-labeling the multi-label entity prediction model is as follows:
s41), pre-coding the character sequence segments to be marked through a pre-coding module, wherein the pre-training coding module comprises an embedding layer and a pre-training coder, the embedding layer is obtained by embedding vocabularies, embedding the positions of the vocabularies and embedding three embedded vectors into the segments of the vocabularies, and the pre-training coder adds a segment circulation mechanism on the basis of a BERT pre-training language model, specifically, after the complete character sequence is divided into segments, the coding vector of the character of the previous segment is added with each vector of the current segment, and then the vector sequence is coded through the BERT coder, and the calculation formula is as follows:
Figure 100002_DEST_PATH_IMAGE001
(1),
wherein w represents vocabulary, i represents the ith segment, j represents the jth vocabulary in the current segment, the 0 th vocabulary in each segment refers to special characters, and h represents a pre-coding vector output by a BERT pre-training language model;
s42), encoding a text sequence with the length L into a second-order tensor with the shape of [ h0, L ] after pre-encoding, wherein each column in the tensor represents a context feature vector of one character in the text, and h0 represents the length of each character feature vector; considering that text content is not formed by simply linearly connecting characters in series, a semantic unit is taken as a node, a grammar dependency relationship is taken as an edge to connect the nodes to form a generated relationship graph among all vocabularies, a graph convolution network is constructed based on the graph, the number of convolution kernels of the graph convolution network is set as the number of label categories, the constructed graph convolution network is used for carrying out nonlinear transformation on a second-order tensor output by a pre-coding module to obtain K different feature maps, each map is a matrix, and K is the number of the label categories;
through the step, the precoding characteristic with the shape of [ H0, L ] is converted into a third-order tensor with the shape of [ K, H, L ], and H and L respectively represent the dimension and the sequence length of the hidden layer;
s43), processing all feature maps one by using a maximum pooling method, and extracting the maximum correlation score of each character in each map through the dimension compression action of a pooling layer, namely a classification feature matrix M with the shape of [ K, L ], wherein each element in the matrix represents the correlation score of one character and a label; l represents the sequence length of the hidden layer;
s44), carrying out element-by-element normalization processing on the matrix M by using a sigmoid function, scaling each element in the matrix M to 0 or 1, if the numerical value is less than 0.5, mapping the numerical value to 0, otherwise, mapping the numerical value to 1, and through the operations, converting the text sequence into a sparse classification matrix, wherein each column of the sparse classification matrix represents the labeling result of the model on the vocabulary in the corresponding position, and for each vocabulary, extracting the position which is not 0 in the vocabulary labeling result and mapping the position to the corresponding label, wherein the labels are the automatic labeling result of the model on the vocabulary.
Further, the process of optimizing the multi-label entity prediction model based on the online error correction feedback model optimization framework is as follows:
s51), deploying the multi-label entity prediction model trained in the early stage into an online service, giving a cleaned document to be labeled, and automatically labeling the document by the multi-label entity prediction model to obtain an entity contained in the document;
s52), the online service transmits the prediction result of the multi-label entity prediction model back to the local, the prediction result is played through the local human-computer interaction voice equipment, after the playing is finished, the server and the user perform online question-answer interaction, the server actively asks the satisfaction degree of the user on the automatic labeling answer, prompts the user to identify the wrong labeling result, and then gives the correct answer;
s53), the real-time interactive log of the server and the user is transmitted to a background, the background carries out emotion analysis on the feedback text of the user, then a scoring table of entity labeling results is generated according to the emotion analysis results, a threshold value is set according to experience, labeling entities with the scores lower than the threshold value are discarded, labeling entities with the scores higher than the threshold value are reserved, correct answers identified by the user are added into a small sample training library, and meanwhile, a final correct answer is generated;
s54), carrying out secondary training on the multi-label entity prediction model through the newly labeled small sample corpus, and updating the weight parameters of the automatic labeling model; and post-processing the labeling result after manual review, formatting a data format required by the task-oriented interaction process, and uploading the data format to a cloud database to store the data format into structured knowledge.
Further, in step S02, each entity defines two labels, which are B-entity and I-entity, respectively, where B represents the starting position of the entity, I represents the inside of the entity, and if a certain character does not belong to any category of entities, it is labeled as O.
Further, in step S03, the cleaning means deleting contents with low correlation degree and symbols other than the separators, the association sorting means gathering multiple revisions of the same topic, then arranging according to the sequence of the time of departure, and after the extraction of the entity labels is completed, replacing the old entity value with the new entity value.
Further, the way that multiple revisions of the same theme are collected is as follows: and arranging all the documents from front to back according to the release time, and performing fuzzy matching on the name of each document and the contents of all other documents one by one through two cycles to obtain other documents which are used for subsequently quoting and revising the document.
Furthermore, before the multi-label entity prediction model is used for prediction, a small amount of cold start labeling data is used for pre-training the model, and the weight parameters of the multi-label entity prediction model are solved through pre-training.
Furthermore, the method is suitable for extracting the structured information of the tax text.
The invention has the beneficial effects that: the invention provides a multi-label labeling method which mainly comprises a multi-label entity labeling processing flow, a multi-label entity prediction model based on deep learning and a feedback type model optimization framework based on online error correction. Compared with the existing entity labeling method, on one hand, the method provided by the invention can realize labeling extraction of specific information in the tax field, on the other hand, the automatic labeling model provided by the invention can solve the problem that the same character string is endowed with multiple entity labels, and in addition, the real-time feedback type model optimization framework provided by the invention provides a feasible scheme for iterative evolution of the model, so that the model is gradually optimized in each interaction, and the method has important practical value.
Drawings
FIG. 1 is a flow chart of the process described in example 1;
FIG. 2 is a flowchart of the operation of a multi-label entity prediction model;
FIG. 3 is a flow chart of the operation of the feedback model optimization framework.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
Example 1
The embodiment discloses a multi-label entity labeling method, in particular to labeling of tax texts, as shown in fig. 1, including the following steps:
s01), the national tax administration, the finance department and other related departments often set out some new regulation policies, and the cloud database of the tax regulation policies is updated regularly by using the interface of the periodic crawler. And taking a small amount of data from the database, placing the data locally, and carrying out manual marking at the later stage to be used as training data of model cold start.
S02), tax regulation relates to the aspects of national economy, the important concern information is different under each situation, in order to ensure the versatility of the defined entity under each situation, through reading and analyzing more than one hundred tax texts, the invention summarizes and provides 14 most frequently used and most valuable entity fields in the tax field, the main content is as follows: taxpayer, calculation method of tax amount, tax collection content, preferential condition, preferential amount of tax collection, preferential ratio of tax collection, implementation date, expiration date, tax payment place, tax payment voucher, tax payment period, tax payment subject, tax rate and tax variety.
And constructing corresponding labels according to the 14 defined tax entity categories. Each category may be constructed with two tags, B-entity and I-entity, such as "taxpayer" tag "B-taxpayer" and "I-taxpayer", where B denotes the beginning of the entity and I denotes the inside of the entity. Finally, if a character does not belong to any category of entities, it is labeled "O". This results in a total of 29 labels, possibly labeled as more than one label for each character.
S03), the obtained tax text is cleaned, and the contents with low relevance such as tables, pictures and links, and symbols except delimiters such as comma, period and exclamation mark are removed.
In order to overcome the problem of content contradiction caused by multiple revisions of laws and regulations on the same subject, the method also needs to perform one-time association sequencing on all tax texts. Specifically, the method is to centralize multiple revisions of the same theme, arrange the revisions according to the sequence of the departure time, and replace the old entity value with the new entity value after the labeling and extraction of the entities are completed. Taking the 'personal income tax law' as an example, the version that is delivered in 2011 specifies that the tax rate of the part of personal income below 3500 is 0, the version that is delivered in 2018 is revised to be that the tax rate of the part of personal income below 5000 is 0, and the latest tax rate is the standard when the invention is stored.
In this embodiment, the manner of concentrating multiple revisions of the same theme is as follows: and arranging all the documents from front to back according to the release time, and performing fuzzy matching on the name of each document and the contents of all other documents one by one through two cycles to obtain other documents which are used for subsequently quoting and revising the document.
S04), delivering the cleaned text as a parameter to a deep learning-based multi-label entity prediction model, wherein the model has the main function of automatically labeling character sequences, and the labels automatically predicted by the machines are manually checked later, so that the stage is called a pre-labeling stage. In addition, before the multi-label entity prediction model is used for prediction, a small amount of cold start labeling data is required to be used for pre-training the model, and the weight parameters of the multi-label entity prediction model are approximately solved through pre-training.
The multi-label entity prediction model of the embodiment can endow a plurality of entity labels to the same character string at the same time. For example, the tax subject "about personal income …" includes an entity of "personal income tax" which is a tax type, and the field of "personal income tax" needs to be labeled with two labels: "I-tax theme" and "I-tax type" indicate that the field belongs to both a part of the tax theme and an entity of tax. In addition, different from the entity-relationship joint extraction model, the model in the invention does not pay attention to entity relationship, the entity-relationship joint extraction model cannot extract isolated entities, and in the model, the entity can be labeled even if the entity does not have relationship with other entities.
S05), the feedback type model optimization framework based on online error correction carries out manual review on the result of the pre-labeling, and the generalization capability of the multi-label entity prediction model is scored according to the difference between the result of the manual review and the prediction result of the multi-label entity prediction model, so that the multi-label entity prediction model is optimized.
S06), carrying out post-processing on the original labeling result after manual review, thereby extracting entity information and storing the entity information in a database.
As shown in fig. 2, the process of pre-labeling the multi-label entity prediction model is as follows:
s41), pre-coding the character sequence segment to be marked through a pre-coding module, wherein the pre-training coding module comprises an embedded layer and a pre-training coder, the input information of the model comprises vocabulary content, vocabulary position and vocabulary segment 3, therefore, the embedded layer is obtained by the vocabulary embedding, the vocabulary position embedding and the vocabulary segment embedding three embedding vectors, the pre-training coder adds a segment circulation mechanism on the basis of a BERT pre-training language model, specifically, after the complete character sequence is divided into segments, the coding vector of the character of the previous segment is added with each vector of the current segment, then the vector sequence is coded through the BERT coder, and the calculation formula is as follows:
Figure 993069DEST_PATH_IMAGE002
(1),
wherein w represents words, i represents the ith segment, j represents the jth word in the current segment, the 0 th word in each segment refers to special characters, and h represents a pre-coding vector output by a BERT pre-training language model. The embodiment introduces a fragment circulation mechanism on the basis of BERT, so that the length of context semantic dependence is prolonged.
S42), encoding a text sequence with the length L into a second-order tensor with the shape of [ h0, L ] after pre-encoding, wherein each column in the tensor represents a context feature vector of one character in the text, and h0 represents the length of each character feature vector; consider that text content is not simply a linear concatenation of characters. For example, the sentence "the notification of several problems about business tax by the State tax administration" is generated not according to the linear sequence of "nation", "home" and "tax" …, but includes a non-linear structure, in which a noun object such as "notification" is constructed first, then a side noun such as "business tax" is constructed, and finally, the character segments of "several problems about …", "tax", and the like are connected to form a complete text.
In the step, the semantic units are taken as nodes, grammatical dependency relations such as a principal-predicate object, a shape-supplement and the like are taken as edges, and the nodes are connected to form a generated relation graph among all vocabularies, wherein the generated relation graph contains the generation sequence of all characters in the human brain in the writing process of the article. And constructing a graph convolution network based on the graph, setting the number of convolution kernels of the graph convolution network to be 29, namely the number of label types, and carrying out nonlinear transformation on a second-order tensor output by a precoding module by using the constructed graph convolution network to obtain 29 different feature maps, wherein each map is a matrix.
Through the step, the precoding characteristic with the shape of [ H0, L ] is converted into a third-order tensor with the shape of [29, H, L ], and H and L respectively represent the dimension and the sequence length of the hidden layer;
s43), processing all feature maps one by using a maximum pooling method, and extracting the maximum correlation score obtained by each character in each map through the dimension compression action of a pooling layer, namely a classification feature matrix M with the shape of [29, L ], wherein each element in the matrix represents the correlation score of one character and one label.
For a single sample, the shape of the map feature family is [29, H, L ], and through the maximum pooling layer, the map feature family is converted into a classification feature matrix M of [29, L ].
S44), the matrix M is normalized element by element using a sigmoid function, and each element in the matrix M is scaled to 0 or 1. If the value is less than 0.5, it is mapped to 0, otherwise it is mapped to 1. Through these operations, the text sequence is converted into a sparse classification matrix, where the elements in the matrix are either 0 or 1. Each column of the sparse classification matrix represents the labeling result of the model to the vocabulary in the corresponding position, for each vocabulary, the position which is not 0 in the vocabulary labeling result is taken out and mapped to the corresponding label, and the labels are the automatic labeling result of the model to the vocabulary.
Each column of the matrix corresponds to one character, if a certain element takes the value of 0, a label 'O' is added in a label list of the character, and if a certain element takes the value of 1, a corresponding category label is added in the label list of the character. Since there may be multiple 1's in each sparse vector, multiple annotation tags can be obtained for each character at the same time.
Fig. 3 shows a feedback model optimization framework based on online error correction, which is provided by the present invention, and mainly aims to realize the organic combination of manual review and model optimization in the text labeling process, and play a role in optimizing both aspects through a part of manual participation.
Specifically, the process of optimizing the multi-label entity prediction model based on the online error correction feedback model optimization framework is as follows:
s51), before the model is applied online, the randomly initialized model parameters are pre-trained by a small amount of labeled data through initial cold start. The part of the labeling data needs to be directly labeled by using a human, so that a high-quality labeling sample is obtained.
And deploying the multi-label entity prediction model trained in the early stage into an online service, providing a cleaned document to be labeled, and automatically labeling the document by the multi-label entity prediction model to obtain the entity contained in the document.
S52), the online service transmits the prediction result of the multi-label entity prediction model back to the local, the prediction result is played through the local human-computer interaction voice equipment, after the playing is completed, the server and the user perform online question-answer interaction, the server actively asks the satisfaction degree of the user on the automatic labeling answer, prompts the user to identify the wrong labeling result, and then gives the correct answer.
S53), the real-time interactive log of the server and the user is transmitted to a background, the background carries out emotion analysis on the feedback text of the user, then a scoring table of entity labeling results is generated according to the emotion analysis results, a threshold value is set according to experience, labeling entities with the scores lower than the threshold value are discarded, labeling entities with the scores higher than the threshold value are reserved, correct answers identified by the user are added into a small sample training library, and meanwhile, a final correct answer is generated.
And automatically adding the new labeled samples into a training sample library of the labeled model, and when the newly added samples reach a certain number, the model automatically starts an incremental training process, so that the feedback updating of the model parameters is realized.
S54), carrying out post-processing on the labeling result after manual review, formatting a data format required by the task-oriented interaction process, and uploading the data format to a cloud database to store the data format into structured knowledge. Thus, the complete labeling process is completed.
The above is the main workflow and model architecture of the present invention. The invention provides a tax text-oriented multi-label labeling method which mainly comprises a multi-label entity labeling processing flow, a deep learning-based multi-label entity prediction model and an online error correction-based feedback model optimization framework. Compared with the existing entity labeling method, on one hand, the method provided by the invention can realize labeling extraction of specific information in the tax field, on the other hand, the automatic labeling model provided by the invention can solve the problem that the same character string is endowed with multiple entity labels, and in addition, the real-time feedback type model optimization framework provided by the invention provides a feasible scheme for iterative evolution of the model, so that the model is gradually optimized in each interaction, and the method has important practical value.
The application scenarios of the invention include but are not limited to the fields of artificial intelligence such as information extraction, document structuring, tax question and answer, tax search, entity identification, tax conversation, etc. The specific implementation mode of the invention details how to realize multi-label entity marking in tax text in related products. The flowchart and model block diagrams in the embodiments are only used for explaining the principles and processes of the present invention, and other schemes having similar principles and processes to the present invention should be considered by those skilled in the relevant art when referring to the present invention. The present invention is illustrated only for the convenience of the related technical personnel, and is not limited to the present invention, and any implementation scheme having similar form to the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A multi-label entity labeling method is characterized in that: the method comprises the following steps:
s01), acquiring text content, and constructing a database based on the text content;
s02), defining N entities which are used most frequently and have the most use value in the field based on text content, wherein N is a positive integer, and constructing corresponding labels according to the defined N entities;
s03), cleaning and performing association sequencing on the text content, wherein the association sequencing is sequencing according to time;
s04), the cleaned text is used as a parameter and is transmitted to a multi-label entity prediction model based on deep learning, and the multi-label entity prediction model automatically labels the character sequence and is called as pre-labeling; different from a single-label entity labeling model, the multi-label entity prediction model endows the same character string with more than one entity label, does not pay attention to entity relationships, and labels the entities even if the entities do not have relationships with other entities;
s05), carrying out manual review on the result of the pre-labeling by a feedback type model optimization framework based on online error correction, and scoring the generalization ability of the multi-label entity prediction model according to the difference between the result of the manual review and the prediction result of the multi-label entity prediction model so as to optimize the multi-label entity prediction model;
s06), carrying out post-processing on the original labeling result after manual review, thereby extracting entity information and storing the entity information in a database.
2. The multi-label entity tagging method of claim 1, wherein: the process of pre-labeling the multi-label entity prediction model comprises the following steps:
s41), pre-coding the character sequence segments to be marked through a pre-coding module, wherein the pre-training coding module comprises an embedding layer and a pre-training coder, the embedding layer is obtained by embedding vocabularies, embedding the positions of the vocabularies and embedding three embedded vectors into the segments of the vocabularies, and the pre-training coder adds a segment circulation mechanism on the basis of a BERT pre-training language model, specifically, after the complete character sequence is divided into segments, the coding vector of the character of the previous segment is added with each vector of the current segment, and then the vector sequence is coded through the BERT coder, and the calculation formula is as follows:
Figure DEST_PATH_IMAGE001
(1),
wherein w represents vocabulary, i represents the ith segment, j represents the jth vocabulary in the current segment, the 0 th vocabulary in each segment refers to special characters, and h represents a pre-coding vector output by a BERT pre-training language model;
s42), encoding a text sequence with the length L into a second-order tensor with the shape of [ h0, L ] after pre-encoding, wherein each column in the tensor represents a context feature vector of one character in the text, and h0 represents the length of each character feature vector; considering that text content is not formed by simply linearly connecting characters in series, a semantic unit is taken as a node, a grammar dependency relationship is taken as an edge to connect the nodes to form a generated relationship graph among all vocabularies, a graph convolution network is constructed based on the graph, the number of convolution kernels of the graph convolution network is set as the number of label categories, the constructed graph convolution network is used for carrying out nonlinear transformation on a second-order tensor output by a pre-coding module to obtain K different feature maps, each map is a matrix, and K is the number of the label categories;
through the step, the precoding characteristic with the shape of [ H0, L ] is converted into a third-order tensor with the shape of [ K, H, L ], and H and L respectively represent the dimension and the sequence length of the hidden layer;
s43), processing all feature maps one by using a maximum pooling method, and extracting the maximum correlation score of each character in each map through the dimension compression action of a pooling layer, namely a classification feature matrix M with the shape of [ K, L ], wherein each element in the matrix represents the correlation score of one character and a label; l represents the sequence length of the hidden layer;
s44), carrying out element-by-element normalization processing on the matrix M by using a sigmoid function, scaling each element in the matrix M to 0 or 1, if the numerical value is less than 0.5, mapping the numerical value to 0, otherwise, mapping the numerical value to 1, and through the operations, converting the text sequence into a sparse classification matrix, wherein each column of the sparse classification matrix represents the labeling result of the model on the vocabulary in the corresponding position, and for each vocabulary, extracting the position which is not 0 in the vocabulary labeling result and mapping the position to the corresponding label, wherein the labels are the automatic labeling result of the model on the vocabulary.
3. The multi-label entity tagging method of claim 1, wherein: the process of optimizing the multi-label entity prediction model by the feedback type model optimization framework based on online error correction comprises the following steps:
s51), deploying the multi-label entity prediction model trained in the early stage into an online service, giving a cleaned document to be labeled, and automatically labeling the document by the multi-label entity prediction model to obtain an entity contained in the document;
s52), the online service transmits the prediction result of the multi-label entity prediction model back to the local, the prediction result is played through the local human-computer interaction voice equipment, after the playing is finished, the server and the user perform online question-answer interaction, the server actively asks the satisfaction degree of the user on the automatic labeling answer, prompts the user to identify the wrong labeling result, and then gives the correct answer;
s53), the real-time interactive log of the server and the user is transmitted to a background, the background carries out emotion analysis on the feedback text of the user, then a scoring table of entity labeling results is generated according to the emotion analysis results, a threshold value is set according to experience, labeling entities with the scores lower than the threshold value are discarded, labeling entities with the scores higher than the threshold value are reserved, correct answers identified by the user are added into a small sample training library, and meanwhile, a final correct answer is generated;
s54), carrying out secondary training on the multi-label entity prediction model through the newly labeled small sample corpus, and updating the weight parameters of the automatic labeling model; and post-processing the labeling result after manual review, formatting a data format required by the task-oriented interaction process, and uploading the data format to a cloud database to store the data format into structured knowledge.
4. The multi-label entity tagging method of claim 1, wherein: in step S02, each entity defines two labels, which are B-entity and I-entity, respectively, where B represents the starting position of the entity, I represents the inside of the entity, and if a certain character does not belong to any category of entities, it is labeled as O.
5. The multi-label entity tagging method of claim 1, wherein: in step S03, the cleaning means deleting contents with low correlation degree and symbols other than the separators, the association sorting means gathering multiple revisions of the same theme, then arranging according to the sequence of the time of departure, and after the extraction of the labels of the entities is completed, replacing the old entity values with the new entity values.
6. The multi-label entity tagging method of claim 5, wherein: the way in which multiple revisions of the same topic are collected is: and arranging all the documents from front to back according to the release time, and performing fuzzy matching on the name of each document and the contents of all other documents one by one through two cycles to obtain other documents which are used for subsequently quoting and revising the document.
7. The multi-label entity tagging method of claim 2, wherein: before the multi-label entity prediction model is used for prediction, a small amount of cold start labeling data is used for pre-training the model, and the weight parameters of the multi-label entity prediction model are solved through pre-training.
8. The multi-label entity tagging method of claim 1, wherein: the method is suitable for extracting the structured information of the tax text.
CN202111062720.7A 2021-09-10 2021-09-10 Multi-label entity labeling method Active CN113822026B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111062720.7A CN113822026B (en) 2021-09-10 2021-09-10 Multi-label entity labeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111062720.7A CN113822026B (en) 2021-09-10 2021-09-10 Multi-label entity labeling method

Publications (2)

Publication Number Publication Date
CN113822026A true CN113822026A (en) 2021-12-21
CN113822026B CN113822026B (en) 2022-07-08

Family

ID=78921897

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111062720.7A Active CN113822026B (en) 2021-09-10 2021-09-10 Multi-label entity labeling method

Country Status (1)

Country Link
CN (1) CN113822026B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297987A (en) * 2022-03-09 2022-04-08 杭州实在智能科技有限公司 Document information extraction method and system based on text classification and reading understanding
CN114580577A (en) * 2022-05-05 2022-06-03 天津大学 Multi-mode-oriented interactive data annotation method and system
CN114861600A (en) * 2022-07-07 2022-08-05 之江实验室 NER-oriented Chinese clinical text data enhancement method and device
CN115238702A (en) * 2022-09-21 2022-10-25 中科雨辰科技有限公司 Entity library processing method and storage medium
CN116561317A (en) * 2023-05-25 2023-08-08 暨南大学 Personality prediction method, labeling method, system and equipment based on text guidance
US11972214B2 (en) 2022-07-07 2024-04-30 Zhejiang Lab Method and apparatus of NER-oriented chinese clinical text data augmentation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170060835A1 (en) * 2015-08-27 2017-03-02 Xerox Corporation Document-specific gazetteers for named entity recognition
CN109543183A (en) * 2018-11-16 2019-03-29 西安交通大学 Multi-tag entity-relation combined extraction method based on deep neural network and mark strategy
CN112802570A (en) * 2021-02-07 2021-05-14 成都延华西部健康医疗信息产业研究院有限公司 Named entity recognition system and method for electronic medical record
US20210149993A1 (en) * 2019-11-15 2021-05-20 Intuit Inc. Pre-trained contextual embedding models for named entity recognition and confidence prediction
CN113191148A (en) * 2021-04-30 2021-07-30 西安理工大学 Rail transit entity identification method based on semi-supervised learning and clustering
CN113239191A (en) * 2021-04-27 2021-08-10 北京妙医佳健康科技集团有限公司 Manually-assisted text labeling method and device based on small sample data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170060835A1 (en) * 2015-08-27 2017-03-02 Xerox Corporation Document-specific gazetteers for named entity recognition
CN109543183A (en) * 2018-11-16 2019-03-29 西安交通大学 Multi-tag entity-relation combined extraction method based on deep neural network and mark strategy
US20210149993A1 (en) * 2019-11-15 2021-05-20 Intuit Inc. Pre-trained contextual embedding models for named entity recognition and confidence prediction
CN112802570A (en) * 2021-02-07 2021-05-14 成都延华西部健康医疗信息产业研究院有限公司 Named entity recognition system and method for electronic medical record
CN113239191A (en) * 2021-04-27 2021-08-10 北京妙医佳健康科技集团有限公司 Manually-assisted text labeling method and device based on small sample data
CN113191148A (en) * 2021-04-30 2021-07-30 西安理工大学 Rail transit entity identification method based on semi-supervised learning and clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JING LI ET AL.: "A Survey on Deep Learning for Named Entity Recognition", 《ARXIV:1812.09449V3》 *
单义栋 等: "基于多标签的军事领域命名实体识别", 《计算机科学》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297987A (en) * 2022-03-09 2022-04-08 杭州实在智能科技有限公司 Document information extraction method and system based on text classification and reading understanding
CN114297987B (en) * 2022-03-09 2022-07-19 杭州实在智能科技有限公司 Document information extraction method and system based on text classification and reading understanding
CN114580577A (en) * 2022-05-05 2022-06-03 天津大学 Multi-mode-oriented interactive data annotation method and system
CN114580577B (en) * 2022-05-05 2022-09-13 天津大学 Multi-mode-oriented interactive data annotation method and system
CN114861600A (en) * 2022-07-07 2022-08-05 之江实验室 NER-oriented Chinese clinical text data enhancement method and device
CN114861600B (en) * 2022-07-07 2022-12-13 之江实验室 NER-oriented Chinese clinical text data enhancement method and device
US11972214B2 (en) 2022-07-07 2024-04-30 Zhejiang Lab Method and apparatus of NER-oriented chinese clinical text data augmentation
CN115238702A (en) * 2022-09-21 2022-10-25 中科雨辰科技有限公司 Entity library processing method and storage medium
CN115238702B (en) * 2022-09-21 2022-12-06 中科雨辰科技有限公司 Entity library processing method and storage medium
CN116561317A (en) * 2023-05-25 2023-08-08 暨南大学 Personality prediction method, labeling method, system and equipment based on text guidance

Also Published As

Publication number Publication date
CN113822026B (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN113822026B (en) Multi-label entity labeling method
CN111160008B (en) Entity relationship joint extraction method and system
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN111858944B (en) Entity aspect level emotion analysis method based on attention mechanism
CN108519890A (en) A kind of robustness code abstraction generating method based on from attention mechanism
CN108182295A (en) A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN111694924A (en) Event extraction method and system
CN107766483A (en) The interactive answering method and system of a kind of knowledge based collection of illustrative plates
CN112883175B (en) Meteorological service interaction method and system combining pre-training model and template generation
CN112749562A (en) Named entity identification method, device, storage medium and electronic equipment
CN115599899B (en) Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph
CN111553159B (en) Question generation method and system
CN113254675B (en) Knowledge graph construction method based on self-adaptive few-sample relation extraction
CN114580639A (en) Knowledge graph construction method based on automatic extraction and alignment of government affair triples
CN113128232A (en) Named entity recognition method based on ALBERT and multi-word information embedding
CN114648015B (en) Dependency relationship attention model-based aspect-level emotional word recognition method
CN113051904B (en) Link prediction method for small-scale knowledge graph
CN117194682B (en) Method, device and medium for constructing knowledge graph based on power grid related file
CN116562265B (en) Information intelligent analysis method, system and storage medium
CN113836891A (en) Method and device for extracting structured information based on multi-element labeling strategy
CN112148879B (en) Computer readable storage medium for automatically labeling code with data structure
CN117033423A (en) SQL generating method for injecting optimal mode item and historical interaction information
CN116521857A (en) Method and device for abstracting multi-text answer abstract of question driven abstraction based on graphic enhancement
CN116341519A (en) Event causal relation extraction method, device and storage medium based on background knowledge
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant