CN114091406A - Intelligent text labeling method and system for knowledge extraction - Google Patents
Intelligent text labeling method and system for knowledge extraction Download PDFInfo
- Publication number
- CN114091406A CN114091406A CN202111202937.3A CN202111202937A CN114091406A CN 114091406 A CN114091406 A CN 114091406A CN 202111202937 A CN202111202937 A CN 202111202937A CN 114091406 A CN114091406 A CN 114091406A
- Authority
- CN
- China
- Prior art keywords
- labeling
- data
- text
- entity
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 162
- 238000000605 extraction Methods 0.000 title claims abstract description 69
- 238000013136 deep learning model Methods 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims description 37
- 238000000034 method Methods 0.000 claims description 33
- 238000005070 sampling Methods 0.000 claims description 22
- 238000007781 pre-processing Methods 0.000 claims description 10
- 239000013604 expression vector Substances 0.000 claims description 8
- 238000003062 neural network model Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 238000012790 confirmation Methods 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 4
- ATJFFYVFTNAWJD-UHFFFAOYSA-N Tin Chemical compound [Sn] ATJFFYVFTNAWJD-UHFFFAOYSA-N 0.000 claims description 3
- 230000006399 behavior Effects 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims description 3
- 230000007423 decrease Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 8
- 238000007726 management method Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 6
- 239000007787 solid Substances 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 230000005526 G1 to G0 transition Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a knowledge extraction-oriented intelligent text labeling method and system, and provides an active learning-based intelligent labeling method and system aiming at the problems that a deep learning model in the knowledge extraction process is lack of labeling data, manual data labeling is complicated, a large amount of manual operation is needed by a domain expert, time and labor are consumed, and the like.
Description
Technical Field
The invention belongs to the technical field of knowledge extraction, and particularly relates to an intelligent text labeling method and system for knowledge extraction.
Background
In recent years, a large amount of unstructured data is accumulated in the fields of smart cities, medical treatment, finance and the like, wherein a large amount of valuable field knowledge is contained, and the field knowledge is required to be extracted to construct a specific knowledge graph by using the technologies of named entity identification, relation extraction and the like. Named entity recognition is taken as an example, the purpose is to recognize instance entities of specified concept categories in unstructured texts, and the relation extraction technology is to further extract the relation between the entities on the basis of entity extraction, so that a domain knowledge graph is constructed. Named entity recognition and relationship extraction are the key to information extraction from unstructured data and are the core technologies for knowledge graph construction. The traditional named entity recognition task in the field of natural language processing aims at recognizing entities in a text that include specific categories of names, place names, organization names, proper nouns, and the like. For example, a place name (Oakland), a time (6 months and 20 days 2016), a team (the rider, the warrior team), and a facility (NBA) in the sentence "6 months and 20 days 2016" were recognized by the rider team in the Oakland beat-the-warrior team to obtain the NBA champion ". The named entity recognition and relationship extraction task is to construct entities and relationships under various concept categories in a domain knowledge graph, and a plurality of deep learning methods are proposed on the tasks of named entity recognition, relationship extraction and the like, and are generally divided into the following methods:
the first method is to use an intelligent labeling mode of a deep learning model, such as a CNN model suitable for image labeling, an LSTM model or a BERT model suitable for text labeling, to complete various labeling tasks. For example, the patent publication CN113255879A "a deep learning labeling method, system, computer device, and storage medium" uses the mutual cooperation among the task issuing unit, the labeling unit, and the management unit to connect the whole labeling process in series, thereby simplifying the labeling process. The patent with publication number CN112685999A, "an intelligent hierarchical labeling method", is a hierarchical label labeling method trained by an artificial intelligence model. However, the method mainly depends on a large-scale labeling data set, and is difficult to use due to the lack of large-scale labeling data in a real labeling scene.
The second method is to rely on a labeling administrator to manage learning and labeling of the whole data set, and improve labeling efficiency, for example, the patent of publication No. CN113191120A, "a method, an apparatus, an electronic device, and a storage medium of an intelligent labeling platform" proposes that the administrator divides data to be labeled on the labeling platform for labeling and learning, and manages learning results and labeling results, evaluates the level of labeling personnel, and improves labeling efficiency. The method depends on a data labeling method with knowledge of a deep learning model, and an administrator with knowledge of the field cannot automatically manage a labeled data set.
The third method is to select the data with most representativeness and information quantity to be manually marked by using an active learning technology, and cancel the limitation on the seed data of the label set. For example, the patent publication No. CN111783993A, "intelligent labeling method, device, intelligent platform and storage medium", and the patent publication No. CN113297351A, "text data labeling method and device, electronic device and storage medium" adopt diversity sampling and similarity sampling strategies to select a specific data set for labeling. The patent with publication number CN110968695A, "intelligent labeling method, device and platform based on active learning of weak supervision technology", discloses an intelligent labeling method based on active learning of weak supervision technology, and relates to tasks such as label classification, entity identification, relationship identification, and the like. However, the method mainly aims at a single text classification task or a single text sequence labeling task, and an intelligent labeling method and system for performing named entity recognition and relationship extraction combined knowledge extraction active learning by text data are lacked.
Therefore, when oriented to the task of extracting knowledge from unstructured text, the effect of the above method still encounters the following problems in practical application:
entity names are diverse: in entity identification, entities can have various expressions, except for a few normative entities (such as telephone numbers and email addresses), most entities cannot use name rules to capture the rules, and a statistical context model needs to be constructed for identification. Also, the types of entities are diverse, and there may be a complex hierarchy between types. The large size of the entities makes it impossible to process them using enumeration or manual writing,
the problem of resource shortage: at present, most named entity recognition algorithms depend on supervised models, and a large amount of training corpora are needed to achieve practical performance. However, considering the cost of the markup corpus, it is impossible to obtain enough corpus to handle various situations such as different fields, different styles of texts, and non-normative specifications in most of the practical situations.
Open nature problem of entities: the openness of an entity means that the entity is not a closed set, but grows, evolves and fails over time. The openness makes the existing supervision method unable to adapt to the extraction of open knowledge, and the performance of the existing model is reduced along with the change of time.
The above problems fundamentally limit the performance of most named entity alignment and relationship extraction models.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an intelligent text labeling method and system for knowledge extraction, which can iteratively train a knowledge extraction model under the condition of limited labeling of a user, so as to achieve the most accurate knowledge extraction effect.
In order to achieve the above purposes, the invention adopts the technical scheme that:
a knowledge extraction-oriented intelligent text labeling method comprises the following steps:
s1, labeling a part of text data in the unstructured text to generate labeled text data;
s2, inputting the labeled text data in the step S1 as a training set into a deep learning model for knowledge extraction, and training the deep learning model;
s3, performing model prediction on the residual unlabeled text data by using the trained deep learning model to finish the pre-labeling of the residual unlabeled text data;
s4, evaluating the reliability of the residual un-labeled text data pre-labeling result in the step S3, adding the reliable prediction result into a training set, sampling a part of representative samples from the samples for predicting the unreliable data, and carrying out the next round of labeling;
and S5, carrying out a new round of labeling on the representative samples in the step S4, and adding the labeling results into the training set until all the data are identified.
Further, the labeling comprises labeling the position and the entity category of the named entity of the unstructured text, and labeling the relationship category between different entities; the labeling result comprises named entity labeling and entity relation labeling of the unstructured text.
Further, in step S2, the deep learning model is input as an unstructured text and an entity label and a relationship label of the text, and includes a representation learning layer, a model layer, and a prediction layer.
Further, expression vectors of a character level and a word level are set in a presentation learning layer, and meanwhile, the expression vectors pre-trained by a Transformer or a Bert model through an external sea data set are introduced.
Further, the representation learning layer introduces a solid dictionary vector, and uses the existing solid dictionary to constrain the recognition of the solid.
Further, a table labeling prediction method adopting knowledge extraction is adopted in a prediction layer, and the table labeling prediction method comprises the following steps:
s201, establishing a two-dimensional matrix of n x n as a marking result table according to the length n of the unstructured text;
s202, representing each character of the unstructured text by rows and columns of a table;
s203, filling the named entity recognition result of the character in a diagonal grid [ i, j ] (i ═ j) of the table by using a BIOS labeling method;
and S204, filling a relation between an entity in which the j-th character is positioned and an entity in which the i-th character is positioned in an off-diagonal grid [ i, j ] (i! j) of the table, wherein the relation can be from the i-position entity to the j-position entity, and can also be from the j-position entity to the i-position entity, and the label table format symmetric table expresses that the entity identification and the relation extract the labeling results of the two tasks.
Further, in step S4, the representative samples to be labeled in each round are obtained by active learning technique screening, and the method for evaluating the "reliability" of the pre-labeling result of the remaining unlabeled text data includes the following sub-steps:
s401, obtaining the expression vector of the text by the trained deep learning model hidden layer, and carrying out Kmeans clustering on all the unlabeled texts;
s402, sampling according to the WPMI, wherein the WPMI is calculated in a mode as follows:
wherein,is a certain entity tiProbability of occurrence in text, f (t)1) Representing the occurrence of an entity tiN denotes the number of all texts, and, as such,indicates occurrence (t)1,…,tm) Text probability of entity, (t)1,…,tm) Is all entities in which this text p appears, wherein the higher the calculated value of WPMI (P), the higher the sample "The lower the reliability "the less reliable the selected prediction sample is judged to be when the resulting wpmi (p) value is higher than a set threshold.
Further, when the training curve of the deep learning model tends to be smooth and does not decline any more in the labeled data of different batches, model prediction is carried out on the rest data, secondary confirmation is carried out on the prediction result, part of labeled samples with high reliability are directly used as labeled data, and knowledge extraction of the whole data set is completed after several rounds of confirmation.
An intelligent text labeling system for knowledge extraction comprises a model learning unit for constructing a deep learning model and a data labeling unit for constructing an active learning module;
the data annotation unit comprises an annotation module, a data collection management module and a service module, wherein the annotation module is mainly used for processing various annotation behaviors of a user; the data set management module is mainly used for maintaining a data set to be labeled, a labeled data set, an entity dictionary, a pre-training model and a labeling rule;
the model learning unit comprises a data preprocessing module, a neural network model, a sample sampling module and a service module; the data preprocessing module is used for preprocessing user labeling data into an input format of a deep learning model and training in batches; the neural network model is used for constructing a deep learning model combining named entity identification and relationship extraction by using the batch of labeled data; the sample sampling module predicts the unmarked data by the trained deep learning model, and selects the data with lower prediction 'credibility' as the next group of data to be marked to be transmitted to the data marking service.
Furthermore, the data labeling unit and the model learning unit are respectively deployed on a GPU server convenient for deep model training and a server convenient for users to respond efficiently, and data transmission between the two services is completed through the service module.
Further, the sample sampling module maintains both unlabeled data and labeled data, and samples valuable samples from the unlabeled data for each iteration are labeled by domain experts.
Furthermore, labeled data are added into the labeling module after the labeling is finished, the deep learning model is used for training, the residual label-free data are predicted after the training is finished, samples with uncertainty of prediction results higher than a certain threshold value need to be further labeled, and the process is repeated until the label-free data are empty.
The invention has the following effects: the invention discloses an intelligent text labeling method and system for knowledge extraction, which aims at the problems that a deep learning model in the knowledge extraction process is lack of labeling data, manual data labeling is complicated, a large amount of manual operation of field experts is needed, time and labor are consumed, and the like.
Drawings
FIG. 1 is an operation flowchart of an intelligent text labeling method for knowledge extraction according to the present invention;
FIG. 2 is a schematic diagram of a joint knowledge extraction deep learning model of the intelligent text labeling system for knowledge extraction according to the invention;
FIG. 3 is a schematic diagram of an active learning technique of the intelligent text labeling system for knowledge extraction according to the present invention;
FIG. 4 is a page jump flow chart of an intelligent labeling system of the intelligent text labeling method for knowledge extraction according to the present invention;
FIG. 5 is a structural framework diagram of an intelligent text annotation system for knowledge extraction according to the present invention;
FIG. 6 is a frame diagram of a data annotation part of the intelligent text annotation system for knowledge extraction according to the present invention;
FIG. 7 is a frame diagram of a model learning part of the intelligent text labeling system for knowledge extraction according to the present invention;
FIG. 8 is a sample diagram of a labeling interface of the intelligent text labeling system for knowledge extraction according to the present invention.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
The first embodiment is as follows:
as shown in fig. 1, an intelligent text labeling method for knowledge extraction includes the following steps:
s1, the domain expert marks the position and the category of the entity in the unstructured text, marks the incidence relation between different entities and generates a part of marking data;
s2, inputting the marked data as a training set into a deep learning model for the joint named entity recognition and relationship extraction to carry out deep learning model training;
s3, predicting the positions and the types of the entities and predicting the relationship types among the entities for the residual unlabeled text data;
s4, evaluating the reliability of the entity recognition and relationship extraction results, adding the reliable prediction results into a training set, sampling a part of representative samples from the unreliable samples, and providing the samples for the next round of labeling by field experts;
and S5, the expert performs a new round of labeling and adds the labeling result into the training set until all the data are recognized.
According to the invention, an intelligent labeling system is established by means of a deep learning module and an active learning module, and a user freely sets a named entity category and a category dictionary by establishing an intelligent labeling project and uploading label-free data. And the model structure is adjusted by setting the learning model parameters. And sampling data from the unlabeled data by each round of labeling system, labeling the user, labeling at least m samples, and submitting an active learning task. After the training is finished, the marking is finished and the result is exported; otherwise, circularly marking. The intelligent labeling system can effectively relieve the problem of high labeling cost, so that a user can extract named entities from label-free texts as little as possible, and labeling resources and a deep learning model are utilized to the maximum extent.
Specifically, there is a strong correlation between the two task phases of knowledge extraction: the accuracy of identifying named entities in the unstructured text will affect the accuracy of extracting the relationships between the entities; the extracted relationships may also affect the identified location and category of the named entity. In step S1 of the present invention, both the named entity location and the entity type of the unstructured text need to be labeled, and on this basis, the relationship type between different entities also needs to be labeled. Therefore, the user labeling results in the same batch comprise named entity labeling and entity relation labeling of the unstructured text.
As shown in FIG. 2, in order to model the association between the knowledge extraction two tasks, in step S2 of the present invention, a deep learning model that combines named entity recognition and relationship extraction is modeled. The input of the deep learning model is an unstructured text and entity labels and relationship labels of the text, the deep learning model comprises different levels such as a representation learning layer, a model layer and a prediction layer, and the two tasks of entity identification and relationship extraction are the same in the representation learning layer and the model learning layer.
The method has the advantages that the expression vectors of character level and word level are set in the expression learning layer, the expression vectors pre-trained by models such as transformers or berts from an external sea data set are introduced, the models can learn semantic information of different levels, and the word segmentation effect can be improved along with the labeling of a labeler.
The category of the named entity can be freely selected from ontology concepts of the domain knowledge graph by a user, and belongs to an open named entity recognition model.
Unstructured texts in different fields have specific characteristics: there are many specialized terms and most concepts have a physical dictionary, and text representations in different domains often have fixed expression habits and patterns. Thus, the representation learning layer of the model introduces a solid dictionary vector, using the existing solid dictionary to constrain the recognition of the entity. In the present embodiment, the medical dictionary is merely exemplified, and is not limited thereto.
And manual setting rules can be introduced to extract grammar or lexical features, so that specific semantic information can be better learned.
And (3) adopting Bi-LSTM to capture scene semantics of a context sequence of the unstructured text at the model layer, and simultaneously comprehensively considering information of the whole text by using a self-attention mechanism.
A conditional random field is adopted to complete sequence labeling of each character in a prediction layer project, all labeling structures can be comprehensively considered, and obvious labeling errors are effectively prevented.
The invention adopts a table marking prediction method of knowledge extraction at a prediction layer:
(1) according to the length n of the unstructured text, establishing a two-dimensional matrix of n x n as a marking result table;
(2) each character of the unstructured text is represented by a row and a column of a table;
(3) filling the named entity recognition result of the character in a diagonal grid [ i, j ] (i ═ j) of the table, wherein a common BIOS labeling method is used in the invention;
(4) filling in an off-diagonal grid [ i, j ] (i | ═ j) of the table a relationship between an entity in which a j-th character is located and an entity in which the i-th character is located, wherein the relationship can be from an i-position entity to a j-position entity, and can also be from the j-position entity to the i-position entity, and the label table format symmetric table expresses label results of two tasks of entity identification and relationship extraction.
In order to fully utilize expert labeling resources, samples to be labeled in each round are obtained by screening through an active learning technology. As shown in fig. 3, the active learning technique can iteratively deliver the samples with the most information content to the annotator for annotation, which is a method for effectively reducing the annotation cost. After each round of expert labeling is finished, the joint knowledge extraction deep learning model trains the labeling data of the round, the residual unlabeled data is predicted and evaluated, and the data with the lowest reliability of the prediction result is selected as the data to be labeled of the next round.
In order to save the labor consumption of expert labeling, when the deep learning model enters a training stationary phase, namely labeling data of different batches, and a training curve of the deep learning model tends to be smooth and does not decline any more, model prediction is carried out on the rest data, and the rest data are sent to the expert for secondary confirmation, part of labeled samples with high reliability are directly used as labeled data, and the knowledge extraction of the whole data set is completed after several rounds of confirmation.
The training curve refers to a convergence process curve obtained by deep learning model training.
Fig. 4 is a page jump flow diagram of the intelligent labeling system, and the intelligent labeling system for knowledge extraction developed by the present invention includes a plurality of page supports such as a labeling project management page, a data uploading page, an entity category selection page, a data labeling page, a task setting page, and a task state page.
On the labeled project management page, the system provides functions of project creation, data uploading, category setting, data labeling, task setting, task state and the like.
And on the data uploading page, the system provides functions of uploading data to be marked and marked, uploading an entity dictionary, uploading marking rules and a pre-training model and the like.
In the entity category selection page, the system provides the functions of selecting ontology model concepts, editing user concept categories, selecting concept category dictionaries and the like.
And on the data annotation page, the system provides functions of data grouping annotation, annotation progress display and the like.
On the task setting page, the system provides functions of deep learning model parameter setting, active learning parameter setting and the like
And the system provides functions of knowledge extraction progress display, knowledge extraction result export and the like.
Example two
Fig. 5 is a structural framework diagram of the intelligent text annotation system for knowledge extraction according to the present invention. The intelligent marking system based on active learning is mainly divided into two parts: the system comprises a model learning unit for constructing a deep learning model and a data labeling unit for constructing an active learning module.
The data annotation unit comprises an annotation module, a data collection management module and a service module, and in the data annotation unit: the marking module is mainly used for processing various marking behaviors of the user; the data set management module is mainly used for maintaining data such as a data set to be labeled, a labeled data set, an entity dictionary, a pre-training model, a labeling rule and the like.
The model learning unit comprises a data preprocessing module, a neural network model, a sample sampling module and a service module. In the model learning part: the data preprocessing module is used for preprocessing user labeling data into an input format of a deep learning model and training in batches; the neural network model is used for constructing a deep learning model combining named entity identification and relationship extraction by using the batch of labeled data; the sample sampling module predicts the unmarked data by the trained deep learning model, and selects the data with lower prediction 'credibility' as the next group of data to be marked to be transmitted to the data marking service.
The data annotation unit and the model learning unit are respectively deployed on a GPU server convenient for deep model training and a server convenient for users to respond efficiently, and data transmission between the two units is completed through a service module.
Fig. 6 is a frame diagram of a data labeling unit of the intelligent labeling system shown in fig. 5, which includes a data set management module, a labeling module and a service module, and all the parts together complete the functions of data storage and maintenance, data labeling and labeling learning task setting, communication between services, and the like.
Fig. 7 is a model learning unit framework diagram of the intelligent labeling system shown in fig. 5, which includes a data preprocessing module, a neural network model, a sample sampling module, and a service module, and which jointly performs functions such as data storage and maintenance, knowledge extraction joint model training and prediction, sample sampling for active learning, and communication between services. Wherein the sample sampling module: while maintaining unlabeled data and labeled data, each iteration samples valuable samples from the unlabeled data and is labeled by a domain expert. And adding labeled data after the labeling is finished, training by a deep learning model, predicting the residual unlabeled data after the training is finished, and further labeling samples of which the uncertainty of the prediction result is higher than a certain threshold value. This process is repeated until the non-tagged data is empty.
Where measuring the "uncertainty" of a sample and sample sampling is a hotspot of active learning studies. The invention uses a WPMI (Weighted Point Mutual Information) based clustering sampling method. In order to provide the most valuable samples for the annotators during each iterative annotation, firstly, the representation vector of the text is obtained from the hidden layer of the trained deep learning model, and Kmeans clustering is carried out on all the unlabeled texts. And then, sampling is carried out according to the WPMI, and samples sampled each time are guaranteed to cover each clustering group as much as possible, so that the marked samples have the maximum value and the marking times are the minimum. The WPMI is calculated in the following mode:
wherein,is a certain entity tiProbability of occurrence in text, f (t)1) Representing the occurrence of an entity tiN denotes the number of all texts. Also, in the same manner as above,indicates occurrence (t)1,…,tm) Text probabilities of entities. (t)1,…,tm) Is all the entities where this text p appears. Wherein the higher the calculated value of wpmi (p), the lower the "reliability" of the sample, and when the obtained value of wpmi (p) is higher than a set threshold, the selected prediction sample is judged to be "unreliable".
As shown in fig. 8, which is a sample diagram of a labeling interface of the knowledge extraction-oriented intelligent text labeling system according to the present invention, 8-1 is a preview of a sample of data to be labeled at the left side of the interface, and 8-2 is a labeling interface of a current sample of data at the right side of the interface. The method comprises the steps that the current labeling progress and each service state of an intelligent labeling system are arranged above a main interface, a user labeling area of the main interface is an area for a user to label, and is used for selecting a phrase and clicking a certain concept category to finish labeling of a named entity, an AI intelligent recommendation area of the main interface is a prediction result of a knowledge extraction joint model on a data sample and is used for recommending user labeling, and the user can click the recommendation result to finish labeling. The entity relationship labeling interface is similar to the interface.
The invention provides an intelligent text labeling method and system based on active learning, aiming at the problems that a deep learning model in the knowledge extraction process is lack of labeled data, the labeling of artificial data is complicated, a large amount of manual operation of field experts is needed, time and labor are consumed and the like.
The method and system of the present invention are not limited to the embodiments described in the detailed description, and those skilled in the art can derive other embodiments according to the technical solutions of the present invention, which also belong to the technical innovation scope of the present invention.
Claims (12)
1. A knowledge extraction-oriented intelligent text labeling method comprises the following steps:
s1, labeling a part of text data in the unstructured text to generate labeled text data;
s2, inputting the labeled text data in the step S1 as a training set into a deep learning model for knowledge extraction, and training the deep learning model;
s3, performing model prediction on the residual unlabeled text data by using the trained deep learning model to finish the pre-labeling of the residual unlabeled text data;
s4, evaluating the reliability of the residual un-labeled text data pre-labeling result in the step S3, adding the reliable prediction result into a training set, sampling a part of representative samples from the samples for predicting the unreliable data, and carrying out the next round of labeling;
and S5, carrying out a new round of labeling on the representative samples in the step S4, and adding the labeling results into the training set until all the data are identified.
2. The intelligent text labeling method for knowledge extraction as claimed in claim 1, wherein: the labeling comprises labeling the named entity position and the entity type of the unstructured text and labeling the relationship type among different entities; the labeling result comprises named entity labeling and entity relation labeling of the unstructured text.
3. The intelligent text labeling method for knowledge extraction as claimed in claim 2, wherein: in step S2, the deep learning model is input as an unstructured text and an entity label and a relationship label of the text, and includes a representation learning layer, a model layer, and a prediction layer.
4. The intelligent text labeling method for knowledge extraction as claimed in claim 3, wherein: and setting a character-level and word-level expression vector in a presentation learning layer, and introducing an expression vector pre-trained by a Transformer or Bert model from an external sea data set.
5. The intelligent text labeling method for knowledge extraction as claimed in claim 4, wherein: the representation learning layer introduces entity dictionary vectors and uses the existing entity dictionary to constrain the recognition of the entity.
6. The intelligent text labeling method for knowledge extraction as claimed in claim 3, wherein: the table labeling prediction method adopting knowledge extraction in a prediction layer comprises the following steps:
s201, establishing a two-dimensional matrix of n x n as a marking result table according to the length n of the unstructured text;
s202, representing each character of the unstructured text by rows and columns of a table;
s203, filling the named entity recognition result of the character in a diagonal grid [ i, j ] (i ═ j) of the table by using a BIOS labeling method;
and S204, filling a relation between an entity in which the j-th character is positioned and an entity in which the i-th character is positioned in an off-diagonal grid [ i, j ] (i! j) of the table, wherein the relation can be from the i-position entity to the j-position entity, and can also be from the j-position entity to the i-position entity, and the label table format symmetric table expresses that the entity identification and the relation extract the labeling results of the two tasks.
7. The intelligent text labeling method for knowledge extraction as claimed in claim 1, wherein: in step S4, representative samples to be labeled in each round are obtained by active learning technique screening, and the method for evaluating the "reliability" of the pre-labeling result of the remaining unlabeled text data includes the following substeps:
s401, obtaining the expression vector of the text by the trained deep learning model hidden layer, and carrying out Kmeans clustering on all the unlabeled texts;
s402, sampling according to the WPMI, wherein the WPMI is calculated in a mode as follows:
wherein,is a certain entity tiProbability of occurrence in text, f (t)i) Representing the occurrence of an entity tiN denotes the number of all texts, and, as such,indicates occurrence (t)1,…,tm) Text probability of entity, (t)1,…,tm) All entities in which this text p appears, wherein a higher calculated value of wpmi (p) indicates a lower "reliability" of the sample, and when the obtained wpmi (p) value is higher than a set threshold value, the selected prediction sample is judged to be "unreliable".
8. The intelligent text labeling method for knowledge extraction as claimed in claim 1, wherein: when the labeled data of different batches and the training curve of the deep learning model tend to be smooth and do not decline any more, model prediction is carried out on the rest data, secondary confirmation is carried out on the prediction result, part of labeled samples with high reliability are directly used as labeled data, and the knowledge extraction of the whole data set is completed after several rounds of confirmation.
9. The intelligent text labeling system oriented to knowledge extraction is characterized in that: the system comprises a model learning unit for constructing a deep learning model and a data labeling unit for constructing an active learning module;
the data annotation unit comprises an annotation module, a data collection management module and a service module, wherein the annotation module is mainly used for processing various annotation behaviors of a user; the data set management module is mainly used for maintaining a data set to be labeled, a labeled data set, an entity dictionary, a pre-training model and a labeling rule;
the model learning unit comprises a data preprocessing module, a neural network model, a sample sampling module and a service module; the data preprocessing module is used for preprocessing user labeling data into an input format of a deep learning model and training in batches; the neural network model is used for constructing a deep learning model combining named entity identification and relationship extraction by using the batch of labeled data; the sample sampling module predicts the unmarked data by the trained deep learning model, and selects the data with lower prediction 'credibility' as the next group of data to be marked to be transmitted to the data marking service.
10. The intelligent text labeling system for knowledge extraction as claimed in claim 9, wherein: the data annotation unit and the model learning unit are respectively deployed on a GPU server convenient for deep model training and a server convenient for users to respond efficiently, and data transmission between the two services is completed through a service module.
11. The intelligent text labeling system for knowledge extraction as claimed in claim 9, wherein: the sample sampling module simultaneously maintains unlabeled data and labeled data, and samples valuable samples from the unlabeled data at each iteration are labeled by domain experts.
12. The intelligent text labeling system for knowledge extraction as claimed in claim 9, wherein: and adding labeled data after the labeling of the labeling module is finished, training by a deep learning model, predicting the residual unlabeled data after the training is finished, further labeling samples with uncertainty of a prediction result higher than a certain threshold value, and repeating the process until the unlabeled data are empty.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111202937.3A CN114091406A (en) | 2021-10-15 | 2021-10-15 | Intelligent text labeling method and system for knowledge extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111202937.3A CN114091406A (en) | 2021-10-15 | 2021-10-15 | Intelligent text labeling method and system for knowledge extraction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114091406A true CN114091406A (en) | 2022-02-25 |
Family
ID=80297027
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111202937.3A Pending CN114091406A (en) | 2021-10-15 | 2021-10-15 | Intelligent text labeling method and system for knowledge extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114091406A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116070602A (en) * | 2023-01-05 | 2023-05-05 | 中国科学院计算机网络信息中心 | PDF document intelligent labeling and extracting method |
CN116431757A (en) * | 2023-06-13 | 2023-07-14 | 中国人民公安大学 | Text relation extraction method based on active learning, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108959252A (en) * | 2018-06-28 | 2018-12-07 | 中国人民解放军国防科技大学 | Semi-supervised Chinese named entity recognition method based on deep learning |
CN110826303A (en) * | 2019-11-12 | 2020-02-21 | 中国石油大学(华东) | Joint information extraction method based on weak supervised learning |
CN112818676A (en) * | 2021-02-02 | 2021-05-18 | 东北大学 | Medical entity relationship joint extraction method |
CN113128227A (en) * | 2020-01-14 | 2021-07-16 | 普天信息技术有限公司 | Entity extraction method and device |
-
2021
- 2021-10-15 CN CN202111202937.3A patent/CN114091406A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108959252A (en) * | 2018-06-28 | 2018-12-07 | 中国人民解放军国防科技大学 | Semi-supervised Chinese named entity recognition method based on deep learning |
CN110826303A (en) * | 2019-11-12 | 2020-02-21 | 中国石油大学(华东) | Joint information extraction method based on weak supervised learning |
CN113128227A (en) * | 2020-01-14 | 2021-07-16 | 普天信息技术有限公司 | Entity extraction method and device |
CN112818676A (en) * | 2021-02-02 | 2021-05-18 | 东北大学 | Medical entity relationship joint extraction method |
Non-Patent Citations (2)
Title |
---|
JUE WANG ET AL.: "Two are Better than One: Joint Entity and Relation Extraction with Table-Sequence Encoders", ASSOCIATION FOR COMPUTING MACHINERY., 20 November 2020 (2020-11-20), pages 1706 - 1718 * |
NING GAO ET AL.: "Active Entity Recognition in Low Resource Settings", ASSOCIATION FOR COMPUTING MACHINERY, 7 November 2019 (2019-11-07), pages 1 - 4 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116070602A (en) * | 2023-01-05 | 2023-05-05 | 中国科学院计算机网络信息中心 | PDF document intelligent labeling and extracting method |
CN116070602B (en) * | 2023-01-05 | 2023-10-17 | 中国科学院计算机网络信息中心 | PDF document intelligent labeling and extracting method |
CN116431757A (en) * | 2023-06-13 | 2023-07-14 | 中国人民公安大学 | Text relation extraction method based on active learning, electronic equipment and storage medium |
CN116431757B (en) * | 2023-06-13 | 2023-08-25 | 中国人民公安大学 | Text relation extraction method based on active learning, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
CN113177124B (en) | Method and system for constructing knowledge graph in vertical field | |
CN110580292B (en) | Text label generation method, device and computer readable storage medium | |
CN111985239B (en) | Entity identification method, entity identification device, electronic equipment and storage medium | |
CN111738003B (en) | Named entity recognition model training method, named entity recognition method and medium | |
CN109858041B (en) | Named entity recognition method combining semi-supervised learning with user-defined dictionary | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
CN111709242B (en) | Chinese punctuation mark adding method based on named entity recognition | |
CN110674642B (en) | Semantic relation extraction method for noisy sparse text | |
CN114091406A (en) | Intelligent text labeling method and system for knowledge extraction | |
CN110633467A (en) | Semantic relation extraction method based on improved feature fusion | |
CN113377897A (en) | Multi-language medical term standard standardization system and method based on deep confrontation learning | |
CN111858842A (en) | Judicial case screening method based on LDA topic model | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
CN113343690A (en) | Text readability automatic evaluation method and device | |
CN112685561A (en) | Small sample clinical medical text post-structuring processing method across disease categories | |
CN115169349A (en) | Chinese electronic resume named entity recognition method based on ALBERT | |
CN114880307A (en) | Structured modeling method for knowledge in open education field | |
CN117909918A (en) | Monitor fault prediction method and system based on fusion characteristics | |
CN112905750A (en) | Generation method and device of optimization model | |
CN117196716A (en) | Digital signage advertisement theme recommendation method based on Transformer network model | |
CN112561530A (en) | Transaction flow processing method and system based on multi-model fusion | |
CN114970557B (en) | Knowledge enhancement-based cross-language structured emotion analysis method | |
CN117216617A (en) | Text classification model training method, device, computer equipment and storage medium | |
CN115906846A (en) | Document-level named entity identification method based on double-graph hierarchical feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |