CN111079436B

CN111079436B - Geological named entity extraction method and device

Info

Publication number: CN111079436B
Application number: CN201911322290.0A
Authority: CN
Inventors: 邓吉秋; 路馥毓; 刘文毅; 李晨菡; 何美香
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2021-09-21
Anticipated expiration: 2039-12-20
Also published as: CN111079436A

Abstract

The invention relates to a geological named entity extraction method, which comprises the following steps: acquiring a target text consisting of a plurality of characters or character strings; acquiring a first regular expression based on the target text and a preset first rule character, extracting a first character string in the target text, and replacing the first character string with a preset eighth rule character to obtain a second target text; judging whether a second target text contains a third rule character or not based on the second target text and a preset third rule character; if yes, acquiring a second regular expression by adopting a preset fourth regular character, a preset second regular character, a preset fifth regular character, a preset sixth regular character and a preset third regular character corresponding to the third regular character, and acquiring a second character string in a second target text; and acquiring the length of a second character string, and acquiring the geological named entity in the target text according to the length and the preset minimum length value corresponding to the third regular character.

Description

Geological named entity extraction method and device

Technical Field

The invention relates to the field of natural language processing, in particular to a method and a device for extracting a geological named entity.

Background

Current state of named entity identification: only good results are achieved in limited text types (mainly in news corpora) and entity categories (mainly in names of people, places and organizations); compared with other information retrieval fields, the entity naming evaluation corpus is smaller, and overfitting is easy to generate; named entity recognition focuses more on high recall rate, but in the field of information retrieval, high accuracy rate is more important; the general system of identifying multiple types of named entities performs poorly.

The general named entity extraction method generally needs a large amount of linguistic data, but it is difficult to accurately find the corresponding background linguistic data with considerable amount when a certain document is specifically analyzed. When the rules are applied to extract the geological named entities, if simple rules are adopted, the extraction effect is generally poor probably because different levels of the Chinese rules, different modes of the basic word combinations and the like cannot be effectively considered.

Disclosure of Invention

Technical problem to be solved

In order to solve the problems that the named entity extraction in the prior art needs to depend on a large number of corpora and is low in extraction precision, the invention provides a method and a device for extracting a geological named entity.

(II) technical scheme

In order to achieve the above object, the present invention provides a method for extracting a geological named entity, comprising:

a1, obtaining a target text consisting of a plurality of characters or character strings;

a2, acquiring a first regular expression corresponding to a first regular character based on the target text and the preset first regular character, and extracting a first character string of the regular expression meeting the first regular character in the target text to obtain a second target text; the second target text is a target text which does not contain the first character string;

the first rule character is a word which is in front of the position of the multi-class geological named entity but does not belong to the geological named entity;

a3, judging whether the second target text contains a third regular character or not based on the second target text and the preset third regular character;

wherein the third rule character is an ending word in a geological named entity;

a4, if yes, acquiring a second regular expression by adopting a preset fourth regular character, a second regular character, a fifth regular character, a sixth regular character and a third regular character which correspond to the third regular character, and acquiring a second character string which meets the second regular expression in the second target text by adopting the second regular expression;

the second rule character is: words that precede the final word in all categories of geological named entities, but do not belong to geological named entities;

the fourth rule character is: a word at any position in the geological named entity before the ending word, but not belonging to the geological named entity;

the fifth rule character is: a word which is adjacent to the head word in the geological named entity but does not belong to the geological named entity;

the sixth rule character is: a word which is adjacent to the end word in the geological named entity but does not belong to the geological named entity;

the seventh rule character is a type code of the geological named entity corresponding to the final word;

and A5, acquiring length information of the second character string, and acquiring the geological named entity in the target text according to the length information and the preset minimum length value corresponding to the third regular character.

Preferably, the step a2 includes:

a2-1, obtaining a first regular expression by the preset first regular character;

a2-2, based on the target text, extracting a first character string of a regular expression meeting first regular characters in the target text by adopting the first regular expression;

a2-3, replacing a first character string in the target text with a character string which is the same as the first character string in length and consists of eighth regular characters to obtain a second target text;

the eighth rule character is a space.

Preferably, the step a5 includes:

a5-1, obtaining a length value of the second character string;

a5-2, judging whether the length value of the second character string meets the preset minimum length value corresponding to the third regular character corresponding to the second character string;

if the result is met, obtaining a geological named entity with an entity text character string and an entity type code in the target text based on the second character string and a preset seventh regular character corresponding to a third regular character corresponding to the second character string;

the entity text character string of the geological named entity is a second character string;

and the type code of the geological named entity is a preset seventh rule character corresponding to the third rule character corresponding to the second character string.

Preferably, the second regular expression character corresponding to the fourth rule character, the second rule character, the third rule character, the fifth rule character and the sixth rule character includes: a second regular expression character having a first label character and a second regular expression character having a second label character;

wherein the first label character is: the form of a second regular expression corresponding to a second regular expression character with a first label character is the label character of the first form;

the first form of the second regular expression is arranged in order: a fourth rule character, a second rule character, a fifth rule character, a third rule character and a sixth rule character;

wherein the second label character is: the form of a second regular expression corresponding to a second regular expression character with a second label character is a label character of a second form;

the second regular expression has a second form: a second regular expression form different from the first form and set in advance.

A geological named entity extraction device storing a first instruction;

the first instructions cause a named entity extraction apparatus to perform a named entity extraction method as described in any one of the above.

(III) advantageous effects

The invention has the beneficial effects that: according to the invention, the geological named entity is extracted according to the first regular character regular expression and the second regular expression without a large amount of corpora, so that the high-precision geological named entity extraction can be realized, and the dependence on a geological professional term corpus is reduced or eliminated.

Drawings

FIG. 1 is a flow chart of a method for extracting a named entity from geological formations according to the present invention;

fig. 2 is a schematic diagram of a method for extracting a geological named entity corresponding to fig. 1 in the embodiment of the present invention.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.

(1) Regarding the regular character in the present embodiment

In this embodiment, the first rule character is set as a general forward boundary word, the second rule character is set as a general prefix boundary word, the third rule character is set as a tail word, the fourth rule character is set as a specific forward boundary word, the fifth rule character is set as a specific prefix boundary word, the sixth rule character is set as a specific suffix boundary word, the eighth rule character is set as a space, and the seventh rule character is set as a type code of a geological named entity corresponding to the tail word.

In this embodiment, the end word as the third rule character: is a common ending word in similar geological named entities, such as: the "set" in the "Yuenu mountain set" appears at the end of the geological named entity as a stratigraphic division unit. The basic suffix is defined as table 1:

TABLE 1 basic suffix definitions

The entity category is a common geological named entity category of geological documents, the category code is a self-defined word segmentation part of speech corresponding to the geological named entity, and the tail words contained in the category are tail words used for extracting the geological named entity corresponding to the category based on multiple regular matching.

In this embodiment, the general forward delimiting word as the first regular character: are words that occur frequently before, but are not part of, multiple classes of geologic named entities, such as: "explanation" in "explain Yuenu mountain group …" and "explain F1 fault …".

In this embodiment, the general prefix delimiting word used as the second regular character: is a word that precedes all the endwords but does not belong to a part of the geological named entity, such as: "see" in the "in-zone see fault …".

The general demarcation word is defined as table 2:

TABLE 2 general demarcation word definitions

The general demarcation word definition table has the following characteristics:

A. wherein, the character strings of the serial numbers 1 to 8 are general forward boundary word combinations, and the character strings of the serial numbers 9 are general prefix boundary word combinations;

B. the first character of the character string with the sequence number of 1-6 is ^ which is used for representing that the character string is a combination of regular expressions, the character string is formed by commas for a plurality of character strings meeting the regular expression rules and adding ^ before the first character string after connection, and the character string is used for limiting the boundary of the geological named entity through phrases;

C. the character strings of the serial numbers 7-9 limit the geological named entity boundary through single characters, wherein the character strings of the serial numbers 7 and 8 directly meet the regular expression rule, and the character string of the serial number 9 meets the regular expression rule after the first character is removed by $;

D. the general dividing word definition table is stored in a database table mode, and the table name is general words; all the general demarcation words are predefined in table 2 according to the rules.

In this embodiment, the specific forward boundary word as the fourth regular character: are words that appear anywhere before the end word of a particular category but do not belong to the geological named entity, such as: the group in the Yunyuan ruyang group Yunmengshan group is not part of the Yunmengshan group, and is a specific forward boundary word of the group.

In this embodiment, the specific prefix delimiting word as the fifth regular character: words that appear in a position before the end word of a particular category that do not belong to that type of geologic named entity, such as: the 'county' in the 'Ji county group' generally only appears in the geological named entities of the 'group' and 'group' but not in other stratigraphic unit entities, and the 'county' is a specific prefix boundary word of other stratigraphic unit entities.

In the present embodiment, as the specific suffix delimiter of the sixth regular character: words that appear in a position after the end word of a particular category that do not belong to that type of geologic named entity, such as: the term "face" in "fault face with calcite filling" cannot be taken together with "fault".

The geological named entity types and specific dividing words thereof are defined as shown in the table 3, and have the following characteristics:

ID is geological named entity type number, Text is a tail word, and Class is a tail word type code;

when the first character of Rules is [ and Rules is a specific forward dividing word, the regular expression form for extracting the geological named entity is as follows: a specific forward delimiter + a general prefix delimiter + a specific prefix delimiter + a suffix + a specific suffix delimiter. The specific forward delimiter is a Rules character string in table 3, the general prefix delimiter is a character string with a serial number of 9 in table 2, the specific prefix delimiter is a part before a Reserve character string tail in table 3, and the specific suffix delimiter is a part after the Reserve character string tail in table 3. Such as: the geology named entity Reserve with type ID of 102 is' the world of the border of China with province, city, county, bottom, south, west and north (? [ ^ to) facial line ]) ".

TABLE 3 geological named entity Categories and specific demarcation term definitions

When the first character of Rules is $, Rules is a regular expression for extracting the geological named entity.

And D, Mini is the minimum length requirement of the geological named entity, and the extracted character string with the length smaller than Mini is not used as the geological named entity.

Storing the geological named entity category and the specific dividing word definition table in a database table mode, wherein the table name is word _ types; all geological named entity types and specific dividing words are predefined in table 3 according to rules.

(2) The steps of extracting the geological named entities in the present embodiment are shown in fig. 1 and fig. 2.

A1, see fig. 1 and 2, a target text composed of a plurality of characters or character strings is obtained.

For example, in the specific application of the present embodiment, step A1 may include the following (2-1), (2-1-2), (2-1-3), and (2-2) steps:

(2-1) entering system initialization, defining a text and rule matching function re _ text, wherein input parameters of the text and rule matching function re _ text are text and regular rule, and output is a list meeting the regular rule in the text, and the function is realized in steps 2-1-1) -2-1-3; and then enters 2-2).

(2-1-1) acquiring text and regular rule of an input parameter, initializing an output parameter re _ words into a null list, and entering 2-1-2).

(2-1-2) judging whether a character string meeting a rule regular expression exists in the text, if so, acquiring the character string meeting the rule and the initial position of the character string in the text; each character string S meeting the rule and the initial position L in the text form a tuple [ S, L ] which is respectively added to re _ words and enters 2-1-3); if no character string satisfying rule exists, enter 2-1-3).

(2-1-3) outputting re _ words as a function return value.

And (2-2) acquiring a target text, wherein the target text in the embodiment is a geological text target text character string geo _ text, and initializing a geological named entity list entry _ list to be a null list.

A2, referring to fig. 1 and fig. 2, obtaining a first regular expression corresponding to a first regular character based on the target text and a preset first regular character and a preset second regular character, and extracting a first character string of the regular expression meeting the first regular character in the target text to obtain a second target text; the second target text is a target text which does not contain the first character string;

in this embodiment, the first regular character is a general forward boundary word, and the second regular character is a general prefix boundary word.

For example, the specific application of step A2 in this embodiment includes the following (2-3) (2-3-1) (2-3-2) (2-3-3) (2-3-4) steps.

(2-3) initializing a general forward boundary word list pre _ words and a general prefix boundary word prefix _ words as null character strings, acquiring a first record of a general boundary word definition list general _ words, and performing steps 2-3-1) -2-3-4) for processing until all records in the general _ words are processed.

(2-3-1) acquiring the word field of the current record, assigning the word field to the current general forward boundary word string g _ Words, and entering 2-3-2).

(2-3-2) obtaining the first character of g _ words, and if the first character is [ OR (, then entering 2-3-3); if the first character is $, deleting the first character from the g _ words, then accumulating the first character to a prefix _ words of the universal prefix boundary word string, and entering 2-3-4); if the first character is ^ the first character of the g _ words is deleted ^ the comma in the g _ words is replaced by) | (left brackets are inserted before the first character of the g _ words (right brackets are inserted after the last character) and 2-3-3 is entered).

(2-3-3) calling a text and rule matching function re _ text to obtain an output value re _ words, and if the re _ words is not a null list, taking a first element of the re _ words as a current element and entering 2-3-3-1); if re _ words is an empty list, then 2-3-4) is entered.

(2-3-3-1) obtaining a current element value [ S, L ], calculating the length len of an S character string (first character string), and replacing all characters from the L-th position from the left to the L + len position in geo _ text with spaces to enter 2-3-3-2).

(2-3-3-2) if the current element is not the last element of re _ words, reading the next element of re _ words as the current element, and entering 2-3-3-1); if it is the last element of re _ words, then go to 2-3-4).

(2-3-4) if the g _ words is not the last record of the general _ words, reading the next record of the general _ words as the current record, and entering 2-3-1); if g _ words is the last record of general _ words, then 2-4) is entered.

A3, referring to fig. 1 and fig. 2, judging whether the second target text contains the third regular character or not based on the second target text and the preset third regular character;

wherein the third rule character is an endword appearing in the geological named entity;

for example, in the embodiment, the step A3 can include the following (2-4), (2-4-1), and (2-4-2):

(2-4) acquiring the geological named entity category and a specific dividing word definition table word _ types, and taking the first record of the word _ types as a current record w _ type to enter 2-4-1).

(2-4-1) obtaining each field value of the current record w _ type, as shown in table 3, respectively assigning to a character string ID, text, class, rule, reserve, and mini, and entering 2-4-2).

(2-4-2) judging whether the current geo _ text contains the tail word text, wherein the current geo _ text is a second target text at the moment, if the current geo _ text contains the tail word text, the entry 2-4-3 is carried out, and if the current geo _ text does not contain the tail word text, the entry 2-4-5 is carried out).

A4, referring to fig. 1 and fig. 2, in this embodiment, when the second target text contains a third regular character, a second regular expression is obtained by using a preset fourth regular character, a second regular character, a fifth regular character, a sixth regular character, and a third regular character corresponding to the third regular character, and a second character string that satisfies the second regular expression in the second target text is obtained by using the second regular expression.

In this embodiment, the fourth regular character is a specific forward dividing word, the fifth regular character is a specific prefix dividing word, the sixth regular character is a specific suffix dividing word, the eighth regular character is a space, and the seventh regular character is a type code of the geological named entity corresponding to the end word.

For example, in a specific application of this embodiment, step A4 may include the following steps (2-4-3) (2-4-4);

(2-4-3) initializing the geological named entity, extracting a regular expression entry _ rule as an empty character string, and acquiring a first character of a rule; if the first character of rule is $, delete the first character, assign it to entry _ rule; if the first character of rule is not $, accumulating the entry _ rule with the character strings rule, prefix _ words and reserve in sequence; enter 2-4-4).

(2-4-4) calling a text and rule matching function re _ text to obtain an output value re _ words, and if the re _ words is not a null list, taking a first element of the re _ words as a current element and entering 2-4-4-1); if re _ words is an empty list, then go to 2-4-5).

A5, obtaining length information of the second character string, and obtaining a geological named entity in the target text according to the length information and the preset minimum length value corresponding to the third regular character

For example, in a specific application of the present embodiment, step a5 may include the following steps:

(2-4-4-1) obtaining a current element value [ S, L ], and calculating the length len of an S character string (a second character string); enter 2-4-4-2) if len is more than or equal to mini), enter 2-4-4-3) if len is less than mini.

(2-4-4-2) inserting class to the end of the current element, and adding the current element of [ S, L, class ] to the entry _ list, and entering 2-4-4-4).

(2-4-4-3) if the current element is not the last element of re _ words, reading the next element of re _ words as the current element, and entering 2-4-4-1); if it is the last element of re _ words, then go to 2-4-5).

(2-4-5) if w _ type is not the last record of word _ types, reading the next record as the current record w _ type, entering 2-4-1); if w _ type is the last record of word _ types, then 2-5) is entered.

And (2-5) outputting the geological named entity list entry _ list.

According to the method, the geological named entity can be extracted with high precision without a large amount of corpora, and the dependency on a geological professional term corpus is reduced or avoided.

The technical principles of the present invention have been described above in connection with specific embodiments, which are intended to explain the principles of the present invention and should not be construed as limiting the scope of the present invention in any way. Based on the explanations herein, those skilled in the art will be able to conceive of other embodiments of the present invention without inventive efforts, which shall fall within the scope of the present invention.

Claims

1. A method for extracting a geological named entity is characterized by comprising the following steps:

a2, acquiring a first regular expression corresponding to a first regular character based on the target text and the preset first regular character, and extracting a first character string of the first regular expression corresponding to the first regular character in the target text to obtain a second target text; the second target text is a target text which does not contain the first character string;

a4, if yes, acquiring a second regular expression by adopting a preset fourth regular character, a second regular character, a fifth regular character, a sixth regular character, a seventh regular character and a third regular character which correspond to the third regular character, and acquiring a second character string which meets the second regular expression in the second target text by adopting the second regular expression;

a5, obtaining length information of the second character string, and obtaining a geological named entity in the target text according to the length information and a preset minimum length value corresponding to a third rule character;

the geological named entity in the target text has an entity text character string and an entity type code;

the step A2 includes:

a2-2, based on the target text, extracting a first character string which meets a first regular expression corresponding to a first regular character in the target text by adopting the first regular expression;

the eighth rule character is a space;

the step A5 includes:

a5-1, obtaining a length value of the second character string;

a5-2, judging whether the length value of the second character string meets the preset minimum length value corresponding to a third regular character corresponding to the second character string;

and the type code of the geological named entity is a preset seventh rule character corresponding to a third rule character corresponding to the second character string.

2. The method of claim 1, wherein the second regular expression character corresponding to a fourth rule character, a second rule character, a third rule character, a fifth rule character, a sixth rule character, and a seventh rule character comprises: a second regular expression character having a first label character and a second regular expression character having a second label character;

the first form of the second regular expression is arranged in order: a fourth rule character, a second rule character, a fifth rule character, a third rule character, a sixth rule character and a seventh rule character;

3. A geological named entity extraction device, wherein the geological named entity extraction device stores a first instruction;

the first instructions cause a named entity extraction apparatus to perform the named entity extraction method of any of claims 1-2.