CN111737951A

CN111737951A - Text language incidence relation labeling method and device

Info

Publication number: CN111737951A
Application number: CN201910212664.7A
Authority: CN
Inventors: 韩英; 刘迪; 王腾蛟; 邱镇; 陈薇; 孟洪民
Original assignee: Peking University; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Zhejiang Electric Power Co Ltd
Current assignee: Peking University; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Zhejiang Electric Power Co Ltd
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2020-10-02
Anticipated expiration: 2039-03-20
Also published as: CN111737951B

Abstract

The invention discloses a method and a device for labeling incidence relation of a text language. By utilizing the close relevance of each information extraction subtask of the text language, a composite labeling method independent of a specific model is designed, multiple text language information extraction tasks can be naturally fused, and the joint learning and integrated training of the multiple text language association tasks is realized, such as joint learning supporting named entity identification and named entity standardization, joint learning supporting named entity identification and entity relation extraction, joint learning supporting named entity identification and entity disambiguation and the like. The text language association relation composite marking method provided by the invention fully utilizes the close association among the subtasks of the text language information extraction, realizes complete joint learning, enables the information sharing among the associated tasks to be mutually promoted, and improves the accuracy and the recall rate of the text language information extraction as a whole.

Description

Text language incidence relation labeling method and device

Technical Field

The invention belongs to the technical field of information, and relates to a method for assisting information extraction of a text language by using a computer intelligent technology. The method specifically relates to a composite labeling method designed by utilizing the close relevance of each information extraction subtask of the text language to naturally fuse multiple text language information extraction tasks, and realizes the joint learning and integrated training of the multiple text language association tasks, so that the information sharing among the association tasks can be mutually promoted, and the accuracy and the recall rate of the text language information extraction are improved.

Background

Text languages are the main expression forms of natural languages and are important carriers of information. In the current information explosion era, the key to data intelligence is how to extract useful structured information from massive unstructured texts. The information extraction of the text language comprises a plurality of subtasks, such as named entity identification, named entity standardization, entity relation extraction and the like. Close association exists among these subtasks, but the traditional method treats these tasks as independent tasks and performs them separately (Peng Z, Sun L, Han X. SIR-ReDeeM: a chip name recognition and organization system using a two-stage method [ C ]// processes of the second communication IPS-SIGHAN Joint Conference on chip Language processing.2012: 115-120), so that these tasks cannot share and complement information.

Currently, a small percentage of researchers have been paying attention to the relevance between text language information extraction subtasks, and LiuX et al (Liu X, Zhou M, Wei F, et al. Joint involvement of named entity and standardization for things [ C ]// Proceedings of the 50th Annual Meeting of the association for practical linkage: Long Papers-Volume 1.Association for practical linkage, 2012: 526-. The joint learning method based on the probability map is not a neural network architecture, depends on feature engineering, is tedious, time-consuming and difficult to adapt to different linguistic data. Zheng S et al (Zheng S, Hao Y, Lu D, et al. joint entry based on a hybrid network [ J ]. neuro-learning, 2017,257:59-66.) propose a hybrid framework of named entity identification and entity relationship extraction, and this joint learning approach is based on neural network, but it is a not thorough joint learning. In the training stage, the optimization of the related parameters of named entity recognition is firstly carried out, and then the training of entity relation extraction is carried out. This two-stage training approach does not achieve global optimization. How to realize the method does not depend on a specific machine learning and deep learning method, and the method can be used for integrated training, which is a very challenging problem.

Disclosure of Invention

In view of the above problems, the present invention aims to provide a model-independent general joint learning strategy supporting integrated training, which does not depend on a specific model and simultaneously supports multi-task integrated training.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for labeling incidence relation of text language includes the following steps:

1) determining at least two related information extraction subtasks of the text language according to the requirements of the text language related tasks;

2) analyzing text corpora and defining a tag set of each information extraction subtask;

3) extracting the label sets of the subtasks by combining all the information to form a composite labeling system;

4) and labeling the text corpus according to the composite labeling system.

Further, the information extraction subtask in step 1) may include, but is not limited to, a named entity identification subtask, a named entity normalization subtask, and a named entity relationship extraction subtask.

Further, step 2) defines a separate labeling system corresponding to each text language information extraction subtask on the corpus, and each information extraction subtask corresponds to a label set and comprises a position of a character in an entity and an entity type.

Further, step 3) extracting subtasks from the information with the association relationship, combining the label sets of the information extracting subtasks, optimizing the public part in the labels of the information extracting subtasks, forming a composite labeling system, and realizing the natural fusion of multiple tasks.

A text language association labeling apparatus, comprising:

the subtask determining module is responsible for determining at least two related information extraction subtasks of the text language according to the requirements of the text language related tasks;

the tag set definition module is responsible for analyzing the text corpus and defining the tag set of each information extraction subtask;

the label combination module is responsible for combining label sets of all the information extraction subtasks to form a composite labeling system;

and the marking module is responsible for marking the text corpus according to the composite marking system.

A machine learning model integrated training method supporting multiple tasks comprises the following steps:

(1) labeling the text corpus according to the composite labeling system by adopting the method to obtain a training data set and a test data set;

(2) selecting a specific machine learning (including deep learning) model;

(3) in the prediction stage, a label sequence obtained by predicting the machine learning model according to an input sequence is decoded according to a composite labeling system to obtain a final label prediction result;

(4) and in the training iterative process of the machine learning model, optimizing on a training data set, simultaneously testing on a testing data set, and stopping training when the result on the testing data set is reduced.

Furthermore, a plurality of tasks are completely fused together through the composite labeling system, so that integrated training is realized, and separate training of each task in multiple stages is not required.

Further, the machine learning model is a traditional machine learning model or a deep learning model based on a deep neural network, and the traditional machine learning model comprises a conditional random field, a hidden markov model or other models based on probability maps.

Further, the decoding extracts entity relationships according to a proximity principle.

A multitasking enabled machine learning model integrated training device, comprising:

the data preparation module is responsible for labeling the text corpus according to the composite labeling system by adopting the method to obtain a training data set and a test data set;

the model selection module is responsible for selecting a specific machine learning model;

the decoding module is responsible for decoding a mark sequence obtained by predicting the machine learning model according to an input sequence according to a composite labeling system in a prediction stage to obtain a final label prediction result;

and the training module is responsible for optimizing the training data set and testing the testing data set in the training iterative process of the machine learning model, and stops training when the result on the testing data set is reduced.

The invention provides a universal joint learning strategy with independent models and supporting integrated training, which does not depend on specific models, supports the traditional machine learning based on statistics, also supports the deep learning based on a deep neural network, and simultaneously supports the multi-task integrated training, such as joint learning supporting named entity identification and named entity standardization, joint learning supporting named entity identification and entity relation extraction, joint learning supporting named entity identification and entity disambiguation and the like. The invention can naturally integrate a plurality of text language information extraction tasks, realize the joint learning and integrated training of the plurality of text language associated tasks, ensure that the information sharing among the associated tasks can be mutually promoted, and improve the accuracy and the recall rate of the text language information extraction.

The invention is an innovation on a labeling method, does not relate to a specific model, and is suitable for both traditional machine learning and deep learning based on a neural network; a composite labeling system is designed by combining label sets of a plurality of subtasks, so that the natural fusion of multiple tasks is realized; the multiple tasks are completely fused together by the composite labeling system, integrated training can be realized, and separate training of each task in multiple stages is not required.

Drawings

FIG. 1 is a schematic diagram of a text-based language association labeling method according to an embodiment of the present invention. Wherein, the diagram (a) is a composite label of named entity identification and standardization; and (b) composite labeling of named entity identification and relationship extraction.

FIG. 2 is a flow chart of steps for an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

Fig. 1 is a schematic diagram of a text language association relation labeling method according to an embodiment of the present invention, where (a) is a composite labeling system of a joint learning framework in which the labeling method is applied to named entity recognition and entity standardization, and (b) is a composite labeling system of a joint learning framework in which the labeling method is applied to named entity recognition and relation extraction. The method for labeling the association relation based on the text language is suitable for the joint learning of subtasks associated with various text languages, and the method is only described by two examples of the joint learning of named entity identification and entity standardization and the joint learning of named entity identification and relation extraction.

The composite label in the diagram (a) of fig. 1 is composed of a tag identified by the named entity and a tag standardized by the named entity, and the style is [ position-entity type-entity standardization symbol ]. Wherein B, I, E, S, O represents the position of the character in the entity, wherein B represents begin, corresponding to the beginning position of the name of the entity (where a word represents a word in Chinese or a word in English); i represents an inter and corresponds to the middle position of the entity name; e represents end, corresponding to the ending position of the entity name; s represents a single, and represents that a corresponding entity only consists of one word; o stands for out and the corresponding character does not belong to a component of the entity name. The "ORG" represents that the type of the entity is an organization class, which can be freely defined according to the task requirement, and the common entity types are "PER" (person name), "LOC" (place name), etc. The name of an entity with only one expression in a document is represented by S, the standard name of the entity with a plurality of expression forms is represented by F, the non-standard name of the entity with a plurality of expression forms, such as short name, alternative name and the like, is represented by A, and F is agreed to be longer than A. In the diagram (a) of fig. 1, "transportation bank" is a standard name, and "delivery bank" is an abbreviation of transportation bank and is a non-standard name of the entity concept of "transportation bank". The 'transportation bank' belongs to an organization. The "intersection" of "transportation bank" is therefore marked as "B-ORG-F", representing the first letter of the standard name of the entity of the agency class.

The composite label in the diagram (b) of fig. 1 is composed of a label identified by the named entity and a label extracted from the relationship between the named entities, and the style is [ position-entity type-entity relationship ]. B. I, E, S, O represents the bits of the character in the entity. The set of entity types and entity relationships need to be freely defined in advance according to the requirements of the tasks. It is defined herein that "ORG" represents that the type of an entity is an organization class, "PER" represents that the type of an entity is a person name class, "LOC" represents that the type of an entity is a place name class, and "CF" represents a relationship of "Company-foundation" (Company-Founder). In the diagram (b) of fig. 1, "plum" is the originator of "sun company" (both the name of the person and the name of the company are imaginary, for example only), and thus both are in a "CF" relationship. The "sun" of "sun company" is labeled "B-ORG-CF" and represents its first character corresponding to an organization entity in the "company-originator" relationship, and similarly, the "ming" of "li ming" is labeled "E-PER-CF" and represents its last character corresponding to a name entity in the "company-originator" relationship. "Beijing" is a place name entity with no defined entity relationship to other entities in the example text, and thus "north" is labeled "B-LOC-S" where "S" represents a single entity with no entity relationship.

FIG. 2 is a flow chart of steps of an embodiment of the present invention, including the steps of:

step 1, the requirements of the tasks related to the text language are clarified, and at least two related information extraction subtasks of the text language are determined according to a specific data set and an application scene. For example, the named entity identifying subtask and the named entity relationship extracting subtask included in the diagram (b) of fig. 1.

And 2, analyzing a specific text corpus, extracting subtasks for the information of each text language, defining a corresponding independent labeling system on the corpus, wherein each task corresponds to a label set and comprises the position of characters in an entity, the entity type and the like.

For example, for the named entity recognition subtask, taking the entity type including organization class and name class as an example, the defined tag set is { B-ORG, I-ORG, E-ORG, S-ORG, B-PER, I-PER, E-PER, S-PER, O }; for the named entity relationship extraction subtask, taking the example of entity relationship including Country-prefix (county-prefix), company-originator (company-foundation), Part-Whole (Part-Whole), the defined set of tags is { e1-CP, e2-CP, e1-CF, e2-CF, e1-PW, e2-PW }, where e1, e2 represent the role position in a pair of entity relationships, e1-CP represents the role of the Country in the Country-prefix relationship.

And 3, for the subtasks with the association relation, combining the labels of the subtasks, and optimizing the public part in the label of each subtask to form a composite annotation system.

For example, the tags of the two subtasks of named entity identification and named entity standardization both contain the position information of characters in the entity, and the common part can be optimized when the tags of the two subtasks are combined, and the two subtasks share the entity position information. In addition, when the labels of each subtask are combined, the application scenarios of specific problems are combined for further optimization, and the number of labels of the label set in the composite labeling system is reduced as much as possible. For example, for the named entity recognition and the named entity relationship extraction, the formed composite labeling system is shown in fig. 1 (b), which includes both the label of the named entity recognition and the label of the entity relationship extraction.

And 4.1, marking the material by using the composite marking system defined in the step 3. The labeled results are shown in FIG. 1 (b).

And 4.2, segmenting the labeled corpus into a training data set and a testing data set.

Step 5.1, selecting a specific machine learning (including deep learning) model, which can be a traditional machine learning model, such as a conditional random field, a hidden markov model or other models based on probability maps, or a deep learning model based on a deep neural network.

And 5.2, defining a cost function according to the machine learning model. A commonly used cost function in the sequence labeling problem is a cross-entropy loss function:

wherein J (theta) represents a cross entropy loss function, theta represents a parameter of the model, m represents the number of training samples, y⁽ⁱ⁾Representing the true probability value, x, of the ith sample⁽ⁱ⁾Represents the ith sample input, h_θA mapping function, h, representing the model_θ(x⁽ⁱ⁾) Representing the predicted output probability value under the mapping of the model for the input of the ith sample.

And 6, decoding the label sequence obtained by predicting the machine learning model prediction according to the input sequence according to the composite labeling system, and translating into a readable entity extraction result by combining the labels predicted by each adjacent character. Namely, the decoding stage of the composite labeling system extracts the entity relationship according to the principle of proximity.

If "li" is labeled as "B-PER-CF" by the model, "ming" is labeled as "E-PER-CF," locations "B" to "E" are a range of entity names, "PER" represents a person name, so the person name entity of "liming" is extracted, and "CF" represents that this entity is a person name entity in a "Company-creator" relationship, and similarly, "sun Company" is an organization class trial question in a "CF" relationship, thus a pair of relationships (sun Company, Company-creator, liming) is obtained. And similarly, decoding other marked sequences to obtain an entity marking result finally output after the model predicts the input text sequence.

Step 7. during the training iteration, optimization is performed on the training data set, typically using a gradient descent algorithm of adaptive learning rate, such as the Adam algorithm (Kingma D P, Ba J. Adam: A method for stochasticotimization [ J ]. arXiv preprint arXiv:1412.6980,2014.). While testing is performed on the test data set, and when the results on the test data set fall, training is stopped. And the fitting capability and the generalization capability of the model are ensured.

Based on the same inventive concept, another embodiment of the present invention provides a device for labeling a text language association relationship, including:

Based on the same inventive concept, another embodiment of the present invention provides a machine learning model integrated training device supporting multiple tasks, comprising:

The specific implementation of the modules is described in the foregoing description of the method of the present invention.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1.A method for labeling incidence relation of text language is characterized by comprising the following steps:

4) and labeling the text corpus according to the composite labeling system.

2. The method of claim 1, wherein the information extraction subtask of step 1) includes: a named entity identification subtask, a named entity standardization subtask and a named entity relationship extraction subtask.

3. The method according to claim 1, wherein step 2) defines a separate labeling system corresponding to each text language information extraction subtask on the corpus, and each information extraction subtask corresponds to a label set and comprises a position of a character in an entity and an entity type.

4. The method according to claim 1, wherein step 3) extracts subtasks for the information with the association relationship, combines the label set of each information extraction subtask, optimizes the common part in the labels of each information extraction subtask, forms a composite labeling system, and realizes the natural fusion of multiple tasks.

5. A text language incidence relation labeling device is characterized by comprising:

6. A machine learning model integrated training method supporting multiple tasks is characterized by comprising the following steps:

(1) labeling the text corpus according to a composite labeling system by adopting the method of any one of claims 1 to 4 to obtain a training data set and a test data set;

(2) selecting a specific machine learning model;

7. The method of claim 6, wherein multiple tasks are fully fused together by the composite annotation architecture, enabling integrated training without separate training of multiple stages of tasks.

8. The method of claim 6, in which the machine learning model is a traditional machine learning model comprising conditional random fields, hidden Markov, or other probability map based models, or is a deep learning model based on a deep neural network.

9. The method of claim 6, wherein the decoding extracts entity relationships on a proximity basis.

10. A machine learning model integrated training device supporting multiple tasks, comprising:

the data preparation module is used for labeling the text corpus according to a composite labeling system by adopting the method of any one of claims 1 to 4 to obtain a training data set and a test data set;