CN110532553B

CN110532553B - Water conservancy space relation word recognition and extraction method

Info

Publication number: CN110532553B
Application number: CN201910771664.0A
Authority: CN
Inventors: 冯钧; 相颖; 夏佩佩; 陆佳民; 朱跃龙
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2019-08-21
Filing date: 2019-08-21
Publication date: 2023-08-22
Anticipated expiration: 2039-08-21
Also published as: CN110532553A

Abstract

The invention discloses a method for identifying and extracting water conservancy space relation words, which comprises the following steps: acquiring a spatial relationship seed set based on quantitative statistical characteristics; constructing an original syntax mode; generalizing the syntax modes, namely generalizing a plurality of original syntax modes expressing similar spatial relations into one mode, reducing the number of modes and improving the abstraction degree; and extracting the spatial relationship based on the generalized syntax mode. The invention focuses on the problem of spatial relation extraction in the water conservancy field, realizes automatic identification of spatial relation, construction of spatial relation word set, acquisition of spatial relation syntax mode and extraction of spatial relation tuples by using a weak supervision method, and saves a great deal of manpower and time; the method realizes the extraction of water conservancy data resources oriented to spatial relationships, converts free texts in water conservancy fields into structured data, and supplements the spatial relationships of the maps in a large scale and in a professional way, thereby providing more accurate query service for users.

Description

Water conservancy space relation word recognition and extraction method

Technical Field

The invention relates to the technical field of water conservancy business, in particular to a method for identifying and extracting water conservancy space relation words.

Background

With the rapid development of internet technology, water conservancy business accumulates massive water conservancy data with spatial relations, wherein the water conservancy data comprise a large amount of official documents. And natural language text is an important source of spatial data, so that the extraction of spatial relationship data from the text is an important research direction in the water conservancy field.

The main purpose of information extraction is to extract specific fact information from the text, namely, unstructured natural language text is converted into structured or semi-structured data and stored, so that knowledge can be conveniently and rapidly acquired by people, the information extraction method can be used for detailed mining analysis, and important functions are played in other fields of natural language processing, such as map construction, intelligent QA (quality assurance) systems and the like. Among them, relationship extraction is becoming a significant part of information extraction, and has recently been receiving more and more attention from researchers, and has become a research hotspot. Therefore, when the space relation is considered, the nodes in the water conservancy data resource knowledge graph should utilize an automatic relation extraction means to supplement space semantic information for the graph, and the supplemented space semantic is required to meet the application requirements of water conservancy services.

The traditional entity relation extraction mainly depends on rule matching, which requires a large number of linguistic experts to provide assistance, effective relation features are selected according to the language structure characteristics of the corpus, and the rule is manually written to carry out matching extraction relation. As a primary means of early, it has met with some success in obtaining entity relationships. This is a good result in certain fields or small corpora. However, manually writing rules is time consuming and labor intensive, and it is expensive to repeatedly write rules in various fields.

Word segmentation: unlike English, english words are separated by spaces, and word segmentation also only needs to be performed by spaces. Chinese is a writing unit based on words, and words are the smallest linguistic components in chinese text, so word analysis is the basis and key for chinese information processing. The Chinese word segmentation technique can be divided into three main categories: word segmentation method based on dictionary matching; word segmentation method based on word frequency statistics and word segmentation method based on knowledge understanding.

Part of speech tagging: part-of-Speech tagging (POS tag) is a short term for labeling each word with its Part of Speech, i.e., identifying whether the word is a verb, noun, adjective, or other Part of Speech. In Chinese, as the parts of speech of Chinese vocabulary are less changeable, the parts of speech tagging is relatively simple, and most words have only one part of speech or the most frequently occurring part of speech is far higher than the part of speech with the second frequency. By selecting the highest frequency part of speech, the accuracy of the Chinese part of speech tagging can reach 80%. More accurate part-of-speech tagging can be achieved with HMMs.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for identifying and extracting water conservancy space relation words, which utilizes a weak supervision method to realize automatic identification of space relation, construction of space relation word sets, acquisition of space relation syntax modes and extraction of space relation tuples, realizes extraction of water conservancy data resources facing the space relation, converts free texts in the water conservancy field into structured data, and supplements large-scale and professional space relation of a map, thereby providing more accurate query service for users.

In order to solve the technical problems, the invention provides a method for identifying and extracting water conservancy space relation words, which comprises the following steps:

(1) Acquiring a spatial relationship seed set based on quantitative statistical characteristics;

(2) Constructing an original syntax mode, wherein P= < e ₁ ，e ₂ ，r，C，Pos _e1 ，Pos _e2 ，Pos _r >Wherein P represents a syntactic pattern, r represents a spatial relationship word, and C represents a set { w } of words excluding entities in a sentence ₁ ，w ₂ ，...，w _n }，e ₁ And e ₂ Two water conservancy entity type labels are respectively adopted;

(3) Generalizing the syntax modes, namely generalizing a plurality of original syntax modes expressing similar spatial relations into one mode, reducing the number of modes and improving the abstraction degree;

(4) And extracting the spatial relationship based on the generalized syntax mode.

Preferably, in the step (1), the acquisition of the spatial relationship seed set based on the quantitative statistical feature specifically includes the following steps:

(11) Preprocessing data; the entity performs word segmentation and part-of-speech tagging on the co-occurrence sentence to form a word set, and filters stop words such as yes, handle, stop words and the like;

(12) Feature selection and statistics; the distribution rule of the spatial relationship words in the sentences is obtained by counting 7 features: (a) part-of-speech POS; (b) the location LOC of the relationship word to the water conservancy object entity; (c) A position LCCP (left and right of two entities or in the middle) when a conjunctive or preposition is arranged on the left side of the spatial relation word; (d) distance DIS1 of the spatial relationship word to entity 1; (e) distance DIS2 of spatial relationship words to sentence ends; (f) spatial relationship word length LEN (in words); (g) The distance DIS (e 1, e 2) between two entities (taking words as units) is used as an important basis for calculating the subsequent extracted spatial relation words;

(13) Guan Jici extraction and instance seed set construction; according to the statistical result obtained in the step (12), taking the importance of the part of speech, the position and the distance of the words into consideration, and obtaining space relation words through calculation of the relation word importance degree;

(14) Expanding relation words; and positioning the line of the spatial relation words in the seed set by means of the layered structure of the synonym dictionary, comparing the 8 th bit of the semantic code, and if the line is "=", taking the relation words in the seed set as unified description words, taking the synonym similar words as candidate words, and establishing a water conservancy spatial relation system and expanding the spatial relation words.

Preferably, in step (2), constructing the original syntax mode specifically includes the following steps:

(21) The seed tuple in the step (1) is used as input to obtain co-occurrence sentences in the corpus, and the sentences are preprocessed;

(22) Performing lexical analysis by using a natural language processing tool, performing syntactic analysis by using a Stanford CoreNLP tool to obtain a syntactic tree, and calculating the relative distance between two words according to the directed path length and the node depth of the syntactic structure tree;

(23) Effective vocabularies such as verbs, nouns, adjectives and the like in the word sequence are reserved, and nonsensical words such as numerical words, pronouns and the like are filtered;

(24) The weight calculation is carried out on the reserved word sequence, the weight of each phrase is measured by utilizing the node distance between the phrase structure and the relation words in the sentence of the syntactic analysis tree, and the semantic code of each word in the word forest is identified;

(25) The locations of the two entities and the relationship words in the sentence are identified and stored as syntactic patterns.

Preferably, in step (3), the generalized syntax mode specifically includes the following steps:

(31) Syntactic pattern clustering, namely calculating similarity when the relative positions of the entities and the related words between two syntactic patterns are the same, the types of the entities are the same, and the same effective words exist in the context, and otherwise, directly considering dissimilarity;

(32) The method comprises the steps of generalizing a syntax mode, clustering to form a plurality of clusters, generalizing the plurality of modes in each cluster into an abstract mode, integrating word sequences in the modes into a sequence, and updating pos, wherein the clusters are modes with high similarity ₁ ，pos ₂ ，pos _r Is a value of (2).

Preferably, in the step (4), the extracting of the spatial relationship based on the generalized syntax mode specifically includes the following steps:

(41) Acquiring a co-occurrence sentence set containing the spatial relationship words through the spatial relationship word set and preprocessing;

(42) Acquiring original syntax patterns of the co-occurrence sentences by using a proposed syntax pattern acquisition method, and generating an original pattern set;

(43) And matching the original mode with each mode in the generalized mode set, and extracting a corresponding spatial relationship tuple according to the position information of the entity and the spatial relationship word in the original mode when the entity and the spatial relationship word are the same in word order, the entity type is the same and the mode similarity is greater than a certain threshold value beta.

The beneficial effects of the invention are as follows: on the basis of the existing entity relation extraction technology, the invention focuses on the problem of spatial relation extraction in the water conservancy field, realizes automatic identification of spatial relation, construction of spatial relation word set, acquisition of spatial relation syntax mode and extraction of spatial relation tuples by using a weak supervision method, and saves a great deal of manpower and time; the method realizes the extraction of water conservancy data resources oriented to spatial relationships, converts free texts in water conservancy fields into structured data, and supplements the spatial relationships of the maps in a large scale and in a professional way, thereby providing more accurate query service for users.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Detailed Description

As shown in FIG. 1, a method for identifying and extracting water conservancy space relation words comprises the following steps:

(2) Constructing an original syntax pattern, p=<e ₁ ，e ₂ ，r，C，Pos _e1 ，Pos _e2 ，Pos _r >Wherein P represents a syntactic pattern, r represents a spatial relationship word, and C represents a set { w } of words excluding entities in a sentence ₁ ，w ₂ ，...，w _n }，e ₁ And e ₂ Two water conservancy entity type labels are respectively adopted;

(4) Extracting a spatial relationship based on the generalized syntax mode;

Firstly, counting characteristics of spatial relationship words in terms of parts of speech, positions and distances in a sample by a BootStrapping method, introducing importance of the characteristics into spatial relationship word extraction calculation, taking the word with the highest importance as the spatial relationship word of a water conservancy entity pair, and preparing a seed subset for the next spatial relationship extraction based on a syntactic pattern.

And secondly, expanding the spatial relation words in the seed set, acquiring the synonyms and the similar words of the spatial relation words from the synonym dictionary, constructing a synonym library, taking the spatial relation words in the seed set as uniform description words and the rest synonyms as candidate words, so that the subsequent spatial relation extraction is convenient, and a solution method of one sense and multiple words is provided for the spatial relation query.

And finally, taking the seed set as input, preprocessing the seed co-occurrence sentence, acquiring the original syntax mode of the spatial relationship, clustering the syntax mode and generalizing the syntax mode, and obtaining the soft mode with high abstraction degree. Searching candidate words of spatial relation words in a synonym library to obtain co-occurrence sentences possibly containing the relation, preprocessing again to obtain a syntax mode, comparing the syntax mode with soft modes in a mode library, and extracting corresponding spatial relation if the similarity conditions are met.

Claims

1. The method for identifying and extracting the water conservancy space relation words is characterized by comprising the following steps:

(2) Constructing an original syntax mode, wherein P= < e ₁ ,e ₂ ,r,C,Pos _e1 ,Pos _e2 ,Pos _r Where P represents a syntactic pattern, r represents a spatial relationship word, C represents a set { w } of words in a sentence except for an entity ₁ ,w ₂ ,...,w _n }，e ₁ And e ₂ Two water conservancy entity type labels are respectively adopted;

(3) Generalizing the syntax modes, namely generalizing a plurality of original syntax modes expressing similar spatial relations into one mode, reducing the number of modes and improving the abstraction degree; the method specifically comprises the following steps:

(32) The method comprises the steps of generalizing a syntax mode, clustering to form a plurality of clusters, generalizing the plurality of modes in each cluster into an abstract mode, integrating word sequences in the modes into a sequence, and updating Pos, wherein the clusters are modes with high similarity _e1 ,Pos _e2 ,Pos _r Is a value of (2);

2. The method for identifying and extracting water conservancy space relation words according to claim 1, wherein in the step (1), the acquisition of the space relation seed set based on quantitative statistical characteristics specifically comprises the following steps:

(11) Preprocessing data; dividing words and marking parts of speech of the entity on the co-occurrence sentence to form a word set, and filtering stop words;

(12) Feature selection and statistics; the distribution rule of the spatial relationship words in the sentences is obtained by counting 7 features: (a) part-of-speech POS; (b) the location LOC of the relationship word to the water conservancy object entity; (c) A position LCCP when a ligature or preposition is arranged on the left side of the spatial relation word; (d) distance DIS1 of the spatial relationship word to entity 1; (e) distance DIS2 of spatial relationship words to sentence ends; (f) spatial relationship word length LEN; (g) The distance DIS (e 1, e 2) between two entities is used as an important basis for the calculation of the subsequent extracted spatial relation words;

3. The method for identifying and extracting a water conservancy space relation word as claimed in claim 1, wherein in the step (2), the construction of the original syntax pattern comprises the following steps:

(22) Performing lexical analysis by using a natural language processing tool, performing syntactic analysis by using a Stanford CoreNLP tool to obtain a syntactic tree, and calculating the relative distance between two words according to the directed path length and the node depth of the syntactic tree;

(23) Reserving effective words in the word sequence, and filtering nonsensical words;

(24) The weight calculation is carried out on the reserved word sequence, the weight of each phrase is measured by utilizing the node distance between the phrase structure and the relation words in the sentence through the syntax tree, and the semantic code of each word in the word forest is identified;

4. The method for identifying and extracting spatial relationship words according to claim 1, wherein in the step (4), the extraction of the spatial relationship based on the generalized syntax pattern specifically comprises the following steps: