CN111126039B

CN111126039B - Relation extraction-oriented sentence structure information acquisition method

Info

Publication number: CN111126039B
Application number: CN201911355241.7A
Authority: CN
Inventors: 秦永彬; 杨卫哲; 程华龄; 陈艳平; 黄瑞章; 王凯
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2022-04-01
Anticipated expiration: 2039-12-25
Also published as: CN111126039A

Abstract

The invention discloses a relation-oriented extraction sentence structure information acquisition method, which comprises the following steps: extracting a relation mention statement which comprises two entities and has a known entity semantic relation category from a data set; secondly, separating and marking the entities in the relation mentioning statement extracted in the first step by using entity markers and separators; thirdly, performing vector mapping on the text based on a pre-training word vector lookup table or a random word vector lookup table; carrying out convolution operation on a vector matrix representing the text through a neural network to extract sentence structure characteristics; fifthly, performing maximum pooling operation on the convolved result to further obtain abstract features; and step six, predicting a classification result by a full connection and Softmax layer. By marking and separating sentence entities before the convolutional neural network, semantic information of contents of each part can be better obtained, sentence structural characteristics taking the entities as the center are obtained, relationship extraction is carried out, and a better performance can be achieved.

Description

Relation extraction-oriented sentence structure information acquisition method

Technical Field

The invention relates to a processing method of input data into a neural network, in particular to a relation-oriented extraction sentence structure information acquisition method, and belongs to the technical field of natural language processing.

Background

With the rapid worldwide popularization of computers and the rapid development of internet technology, various data such as video, audio, pictures, texts and the like are increased rapidly, and a large amount of information appears in the presence of users in an electronic digital form. In order to solve the serious challenge brought by information explosion, a professional automation tool is urgently needed to extract real valuable information from massive data, and information extraction is carried out. The information extraction technology is widely applied in the field of natural language processing, and the relation extraction is an important component in the extraction of text information. Named entities refer to proper nouns in the text representing names of people, places and organization names, and relationship extraction refers to extracting semantic relationships existing between entity pairs in the text from the text in which the entity pairs are marked. For example, the sentence in the ACE RDC2005 dataset "but at least 1000 people are in the firm as estimated by the european security and collaboration organization, for the two named entities" 1000 people "and" jail "present in the sentence, the relationship extraction system can recognize that there is a" PHYS "(geographical relationship) relationship between the two entities.

The information extraction aims at extracting structured information from large-scale unstructured or semi-structured natural language texts, and the main tasks include entity extraction, relation extraction and event extraction. The main content of the relation extraction research is to extract semantic relations between entities from text content, the semantic relations are used as important semantic knowledge carriers in the text, the relation extraction plays an important role in information extraction, after the relation extraction is provided as one of subtasks of the information extraction, the relation extraction is highly emphasized by academic circles, and a series of extensive researches are carried out.

Named entities in the text exist in the form of expression of continuous characters, semantic relation recognition is carried out on the entity pairs by using a relation extraction method after the entities are recognized in the text and marked, and the method mainly can be used for eliminating ambiguity problems caused by different meanings of the same word expressed in different contexts through different word expression methods. Therefore, the entity marks in the text enable the original unified whole to be segmented, and the characteristics of the semantic extraction of each part after the entity segmentation can be used for extracting the entity semantic relationship. The same characters in different contexts are often rich in different semantic information, and in order to ensure the integrity of the original text semantics, it is necessary that each part of the text after entity segmentation is respectively pooled to extract features.

From the language aspect, Chinese and English culture have difference, Chinese and western thinking modes are different, and the difference of Chinese and English information structures and information implementation modes is large due to the difference of background environments generated by Chinese characters and English. The language is divided into analytic type and comprehensive type. The main feature of analytic languages is that the language order is fixed, while the main feature of synthetic languages is that the language order is flexible. English belongs to Indonesian system, is a comprehensive language, has more language structure depending on formal analysis and logical reasoning, has strict language grammar and more clause forms, so sentences are generally longer; the Chinese language belongs to the Tibetan language system and is an analytic language, the language sequence is generally fixed and has no tortuous change, the words are combined into sentences depending on the language sequence and the virtual words, and short sentences in the Chinese language are common. Chinese is an analytic language, which has more stringent structural requirements for word combinations. For example, the text "the bridge of the Changjiang river in Nanjing is located in the Drum district of Nanjing city," and the entity extraction needs to be performed before the relationship extraction, but because the Chinese character combination is easy to generate ambiguity and has a high requirement on the structure of the text combination, different combinations can generate different meanings, and further the result of natural language processing can be influenced. As in the above sentence, the possible results of text combining are as follows: "Nanjing city", "Nanjing city chang", "Yangtze bridge", "Changjiang bridge" … …, it is obvious that the entity pair obtained by different combinations will have a significant influence on the relationship between the extracted entities. The entity "Nanjing city" and "Changjiang river bridge" have an inclusion relationship between places, and if the "Changjiang river bridge" is identified as a name entity, it will have an inclusion relationship between people and places with the entity "Nanjing city Tuohan", etc. In the same sentence, the entity marking result is only "Nanjing", "Yangtze River Bridge" or "Gulou District" in English, which is expressed as "Nanjing Yangtze River Bridge" in English, and will not generate the ambiguous marking like "Nanjing city long" in the Chinese text. It can be seen that if in the relationship extraction, entities of different combinations give different text combination structures in the text, the results of the relationship extraction will be distinct. Therefore, highlighting the structure of the entity in the text and the combination mode of the entity, and acquiring more semantic information through the structural feature will affect a plurality of natural language processing tasks such as relation extraction.

From a theoretical level, the technical research of relationship extraction can provide theoretical support for other natural language processing technologies, and is a natural language processing project worthy of proceeding. The relation extraction has important research significance in the aspects of semantic role marking, chapter understanding and machine translation. In 2013, structural information is extracted by a mode matching method, and a dynamic mode library is used for improving the extraction accuracy, but the recognition effect is influenced by the structure of word segmentation and the existence of professional vocabularies. The existing machine learning method for relation extraction is divided into a supervised method, a semi-supervised method, an unsupervised method and the like. Supervised machine learning methods generally view relationship extraction as a classification problem, i.e., classifying relationships in different sentences for different entities, generally requires defining the category of the relationship in advance. Socher et al began 2012 to solve the relational extraction problem using a recurrent neural network that first parsed sentences and then learned a vector representation for each node on the syntactic tree. Through the recurrent neural network, iterative combination can be carried out from the word vector at the lowest end of the syntax tree according to the syntax structure of the sentence, and finally the vector representation of the sentence is obtained and used for relation classification. The method can effectively consider the syntactic structure information of the sentence, but cannot consider the positions and semantic information of the two entities in the sentence. The semi-supervised method such as the bootstrap method reduces the dependence on the labeled linguistic data in the training process, reduces the cost of manual labeling, but has the semantic drift problem. The unsupervised method mainly uses a clustering algorithm, can be applied to the field of large-scale open information, but is difficult to accurately describe the relationship names. The unsupervised entity relation extraction method does not need to depend on entity relation labeling linguistic data, and two processes of relation instance clustering and relation type word selection are realized. The entity pairs with high similarity are firstly grouped into a class according to the appearance context of the entity pairs, and then representative words are selected to mark the relation.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: on the basis of fully utilizing complete information of sentence texts, an entity marking strategy is adopted, a neural network technology is introduced, the characteristic that high-dimensional abstract features are automatically extracted by neural network layering is fully exerted, and structural features obtained by convolution pooling of all parts of texts marked by entities are extracted. The entity semantic relation extraction is carried out by entity marks in the sentence, so that the neural network obtains relative position information and semantic relation information between words and entity pairs except the entities in the sentence, thereby obtaining structural information of the sentence taking the two entities as the center, and avoiding the characteristic sparsity problem generated by the traditional machine learning method to a certain extent, thereby improving the relation extraction performance and effectively solving the problem that the sentence structural information cannot be well utilized.

The technical scheme of the invention is as follows: a relation-oriented extraction sentence structure information acquisition method comprises the following steps: extracting a relation mention statement which contains two entities and has a known entity semantic relation category from a data set (ACE or SemEval data set); secondly, separating and marking the entities in the relation mentioning statement extracted in the first step by using entity markers and separators; thirdly, performing vector mapping on the text based on a pre-training word vector lookup table or a random word vector lookup table; carrying out convolution operation on a vector matrix representing the text through a neural network to extract sentence structure characteristics; fifthly, performing maximum pooling operation on the convolved result to further obtain abstract features; and step six, predicting a classification result by a full connection and Softmax layer.

In the first step, sentences with entity pairs are extracted from a large amount of unstructured text data sets. The method is mainly applied to Chinese data, and the adopted data set is ACE RDC 2005. The data is stored in an xml format file, xml is an inherent hierarchical data format, and the most natural representation method is to use a tree, so that a tree structure analysis method is adopted to obtain the relation mention statement from the data set. The data extraction method provided in the first step is to use an xml. The module implements a simple and efficient API for parsing and creating xml data, extracting statements and entities from xml files in a dataset. The ElementTree represents that the whole XML document is a tree, elements represent single nodes in the tree, and the nodes of the tree in the data set are referred to by entities, relations, entity headers and the like. Interactions with the entire document (reading and writing files) are typically done at the ElementTree level, and interactions with individual XML elements and their sub-elements are done at the element level. In the first step, the tree structure is used for analyzing the articles in the ACE RDC2005 data set, and information such as relation mention sentences, entity contents, relation types and the like is extracted from the data set according to a storage format of 'entity 1 entity2 relation mention sentence semantic relation'. The characteristics of automatic characteristic extraction of neural network layering are fully exerted, sentence structure information formed by entities in sentences and vocabularies except the entities is obtained, and loss of semantic information is effectively prevented.

In the second step, the only two entities in the relation mentioning sentence extracted in the first step are extracted to the top of the sentence, the mark symbols are respectively used for marking the starting position and the ending position of the two entities, then the marked entity pair in the sentence is copied to the starting position of the sentence, the entity and the entity are separated by the character 0, the entity and the sentence are expected to be sensed by the neural network, and the neural network can acquire the structural information of the sentence in this way.

In the third step, the CNN neural network model based on construction comprises an input layer, a hidden layer and an output layer, wherein word vector mapping is carried out on a text, sentences are converted into vector matrixes and used as the input of a network.

And according to the character vector characteristics and the format required in natural language processing, carrying out vector mapping on characters in the text by using a randomly initialized character vector lookup table and a loaded pre-training character vector lookup table to obtain a vector representation matrix X of the text.

In the fourth step, a convolution operation is performed on the vector matrix X after mapping through the word vector lookup table, and the convolution result is C, where C ═ conv (X).

The scheme performs entity marking and separation on sentences extracted from an ACE RDC2005 Chinese data set by using special markers and separators in a data processing part, and the sentences are mentioned as sentence headers. After the abstract structure features obtained through convolution are input into a neural network, the maximum pooling is carried out on the abstract structure features, high-level abstract features comprehensively expressed by the context to the entity semantic relation are obtained, then the classification result of the Chinese relation extraction task is obtained, and the better performance can be achieved.

The invention has the beneficial effects that: compared with the prior art, the technical scheme of the invention adopts entity marking and separation strategies on the basis of fully utilizing complete information of sentence texts, introduces a neural network technology, fully exerts the characteristic of the neural network for layering and automatically extracting high-dimensional abstract features, extracts the pooling features of each part of the texts marked and separated by the entities, and avoids the feature sparseness problem generated by the traditional machine learning method to a certain extent, thereby improving the performance of relation extraction, combining the characteristic of the neural network for layering and automatically extracting the abstract features with the advantage that sentences subjected to entity marking and separation more enhance the influence of vocabularies except the entities in the texts on the semantic relation of the whole part, and obtaining excellent performance in the aspect of relation extraction.

Drawings

FIG. 1 is a schematic drawing of the extraction technique of the present invention;

FIG. 2 is a drawing model diagram of the present invention;

FIG. 3 is a schematic diagram of the entity tagging and separation method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

Example 1: as shown in fig. 1 to 3, a method for obtaining sentence structure information extracted in a relationship-oriented manner includes the following steps: extracting a relation mention statement which contains two entities and has a known entity semantic relation category from a data set (ACE or SemEval data set); secondly, separating and marking the entities in the relation mentioning statement extracted in the first step by using entity markers and separators; thirdly, performing vector mapping on the text based on a pre-training word vector lookup table or a random word vector lookup table; carrying out convolution operation on a vector matrix representing the text through a neural network to extract sentence structure characteristics; fifthly, performing maximum pooling operation on the convolved result to further obtain abstract features; and step six, predicting a classification result by a full connection and Softmax layer.

In step one, sentences with entity pairs are extracted from a large amount of unstructured text data sets. The method is mainly applied to Chinese data, and the adopted data set is ACE RDC 2005. The data is stored in an xml format file, xml is an inherent hierarchical data format, and the most natural representation method is to use a tree, so that a tree structure analysis method is adopted to obtain the relation mention statement from the data set. The data extraction method provided in the first step is to use an xml. The module implements a simple and efficient API for parsing and creating xml data, extracting statements and entities from xml files in a dataset. The ElementTree represents that the whole XML document is a tree, elements represent single nodes in the tree, and the nodes of the tree in the data set are referred to by entities, relations, entity headers and the like. Interactions with the entire document (reading and writing files) are typically done at the ElementTree level, and interactions with individual XML elements and their sub-elements are done at the element level. In the first step, the tree structure is used for analyzing the articles in the ACE RDC2005 data set, and information such as relation mention sentences, entity contents, relation types and the like is extracted from the data set according to a storage format of 'entity 1 entity2 relation mention sentence semantic relation'. The characteristics of automatic characteristic extraction of neural network layering are fully exerted, sentence structure information formed by entities in sentences and vocabularies except the entities is obtained, and loss of semantic information is effectively prevented.

In the second step, only two entities in the relation mention sentences extracted in the first step are referred to the top of the sentence and are marked with symbols

Marking the beginning and end positions of the entity1 with symbols

Marking the starting position and the ending position of the entity2, copying the marked entity pair in the sentence to the starting position of the sentence, and separating the entity from the entity, the entity and the sentence by using a character '0', so that the neural network can sense the existence and the position of the entity pair in the sentence, and the neural network can acquire the structural information of the sentence in this way.

Assuming that the original sentence is S, when the sentence satisfies the general format, i.e. two entities in the sentence are composed of multiple chinese characters, and the Left part (Left) of the entity1 exists, the Middle part (Middle) between the entity1 and the entity2 is not empty, and the Right part (Right) of the entity2 also exists, the original sentence S can be expressed as:

s＝(s₁，s₂，...，s_i，s_i+1，...，s_i+k，s_i+k+1，...，s_j，s_j+1，...，s_j+1，s_j+i+1，...，s_n)，

wherein s is_i+1，...，s_i+kAnd s_j+1，...，s_j+iRepresenting two entities in the original sentence, and processing the sentence S into S _bythe entity marking method proposed in the step two:

s_↓(j+1)，...，s_↓（j+t)，Mark_↓(2_↓End)，s_↓(j+t+1)，...，s_↓n) wherein the start and end tags of the entity are

For representing the boundaries of an entity, it may be replaced with various symbols. The entity tag used in this embodiment is: entity1 starts<*>Entity1 ends</*>Entity2 begins<#>Entity2 ends</#>。

In the third step, the structural composition of the CNN neural network model based on the construction is shown in fig. 2, and the CNN neural network model includes an input layer, a hidden layer, and an output layer. The text is subjected to word vector mapping, and sentences are converted into vector matrixes to be used as input of the network. The neural network constructed in our experiment is CNN, and the hidden layer comprises a convolution layer, a pooling layer and a full-connection layer. The method mainly comprises the steps of extracting required character vector characteristics and formats in a task according to a Chinese relation aiming at a Lookup Table Lookup Table part in the model, randomly initializing the Lookup Table Lookup Table or loading a pre-training character vector Lookup Table, mapping characters in a text marked by an entity into vectors, and obtaining a vector representation matrix X of the text.

In step four, the vector matrix X after being mapped by the word vector lookup table is convolved, and the convolution result is C, where C is conv (X). The multilayer convolution is to carry out layer-by-layer mapping, a complex function is integrally formed, the training process is to learn the weight required by each local mapping, the process can be regarded as a function fitting process, and features can be extracted through convolution operation. The entity in the sentence is marked with the beginning and the end in the first step and the second step, so that the mark of the entity boundary is obtained, the vector mapping is obtained, then the abstract feature of the sentence is obtained through the convolution operation in the third step, and the feature obtained in the third step is called as the sentence structure feature.

And step five, extracting abstract features by further performing maximum pooling operation on the result generated after convolution, wherein the abstract features can be extracted by the part on the premise of keeping original sentence information as much as possible, so that the size of a sentence is reduced, the receptive field of a convolution kernel is increased, high-level features are extracted, the parameter quantity of a neural network is reduced, and over-fitting training is prevented. The marks and the separation of the entity parts in the text are mainly utilized, so that the vectorized result has better perception capability for the neural network, and sentence structure information is obtained.

The structural characteristics of the sentence are obtained, namely, the sentence subjected to relationship extraction is subjected to Entity marking by using special symbols according to two entities appearing in the sentence in a data preprocessing mode, and before the sentence is placed with Entity1 and Entity2 after marking, an Entity separator is placed between the Entity1 and the Entity2 and between the Entity and the three parts of Left, Middle and Right in the sentence. Because Chinese characters forming the entity in the Chinese or English words forming the compound entity in English belong to an entity whole, the entity and the words except the entity in the sentence can be respectively considered as a whole, the influence of each part on the semantic relation of the entity is obtained after separation, and the structural characteristics of the sentence are obtained by utilizing a neural network. The meanings of "Left", "Middle", "Right", "Entity 1" and "Entity 2" mentioned above are as follows.

Left: the content of the left part of entity1 in the sentence;

entity 1: an entity 1;

middle part: the content of the part between the entity1 and the entity2 in the sentence;

entity 2: an entity 2;

right: the content of the right part of the sentence entity 2.

And step six, performing Softmax operation on the result obtained by the previous vectorization, convolution operation and pooling operation to obtain the output of the neural network.

The present invention will be further described with reference to the following examples:

firstly, executing a step one, namely obtaining sentences and entities of an ACE RDC2005 Chinese data set by using a tree analysis method; then executing the step two, and marking, separating and precondition the entities in the obtained relation mention sentences; then, executing a third step, mapping words in the text marked by the entity into vectors by using a vector lookup table, and obtaining a vector representation matrix X of the text; then executing step four, carrying out convolution operation on the vector quantization matrix X; then, executing a fifth step, and further performing maximum pooling operation on the result generated after convolution to extract abstract features; and finally, executing the step six, namely pooling, full connection and Softmax, and outputting a result.

For example, step one is executed to extract a sentence "from the data set, but according to the estimation of the european security and cooperation organization, at least 1000 people are in the firm, two entities in the sentence are entity 1" 1000 people "and entity 2" firm ", respectively, and the semantic relationship between the entities is" PHYS "(geographical position relationship); then step two is performed, which is processed into by entity labeling and partitioning "

1000 persons

Firm

But according to the evaluation of the European safety and cooperative organization, at least

1000 persons

Is closed at

Firm

The method can enable the neural network to obtain the position of an entity, the beginning and the end of the entity, then obtain the integral representation of the entity, and further obtain the sentence structure information centered by the entity in a sentence; then executing a step three, and vectorizing all characters in the sentence by searching a Google-News pre-training character vector lookup table and a character vector lookup table randomly generated within a certain range; then, step four is executed, and the convolution operation is carried out on the vector quantization matrix. Step five is executed again, abstract features are extracted through further maximal pooling operation on results generated after convolution; and finally, performing feature fusion by using full connection, and obtaining the sentence structure feature through a convolution neural network through a Softmax layer prediction result.

In conclusion, experiments prove that the relation-oriented extraction sentence structure information acquisition method provided by the invention has excellent performance.

The scheme of the invention marks and separates sentence entities before the convolutional neural network, can better obtain semantic information of each part of content and the influence of each part of content in the sentence on the semantic relationship of the two entities, obtains the structural characteristics of the sentence taking the entity as the center, extracts the relationship and can achieve better performance.

The present invention is not described in detail, but is known to those skilled in the art. Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. A sentence structure information acquisition method for relational extraction is characterized in that: the method comprises the following steps: extracting a relation mention statement which comprises two entities and has a known entity semantic relation category from a data set; secondly, separating and marking the entities in the relation mentioning statement extracted in the first step by using entity markers and separators; thirdly, performing vector mapping on the text based on a pre-training word vector lookup table or a randomly generated word vector lookup table; carrying out convolution operation on a vector matrix representing the text through a neural network to extract sentence structure characteristics; fifthly, performing maximum pooling operation on the convolved result to further obtain abstract features; step six, predicting classification results of the fully-connected Softmax layer;

2. The relation-oriented extraction sentence structure information acquisition method according to claim 1, wherein: in the third step, the CNN neural network model based on construction comprises an input layer, a hidden layer and an output layer, wherein word vector mapping is carried out on a text, sentences are converted into vector matrixes to be used as the input of the network, and according to the character vector characteristics and the format required in natural language processing, the characters in the text are subjected to vector mapping by using a randomly generated word vector lookup table and a loaded pre-training word vector lookup table to obtain a vector representation matrix X of the text.

3. The relation-oriented extraction sentence structure information acquisition method according to claim 1, wherein: in the fourth step, the vector matrix X after being mapped by the word vector lookup table is subjected to convolution operation, and the convolution result is C, where C = conv (X).

4. The relation-oriented extraction sentence structure information acquisition method according to claim 1, wherein: the data set is an ACE or SemEval data set.