CN111160030B

CN111160030B - Information extraction method, device and storage medium

Info

Publication number: CN111160030B
Application number: CN201911262829.8A
Authority: CN
Inventors: 付骁弈; 张�杰; 吴信东
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2023-09-19
Anticipated expiration: 2039-12-11
Also published as: CN111160030A

Abstract

An information extraction method comprises the steps of word segmentation of a target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text; according to the application, the entity extraction can be carried out by using the general part-of-speech tagging and dependency relationship identification technology according to the triples of entity relationships in the target text, so that the tagging cost is saved and the robustness of the system model is enhanced.

Description

Information extraction method, device and storage medium

Technical Field

The present application relates to computer technology, and in particular, to a method and apparatus for extracting information, and a storage medium.

Background

Information Extraction (IE) is the process of automatically extracting unstructured information embedded in text into structured data. Information extraction has wide application in the field of internet products and enterprise services. For example, in the process of constructing a search or recommendation engine, information extraction is first required for text content of a web page or a recommendation. The results of the information extraction can be used to de-duplicate documents and more accurately construct search indexes and recommendation features: therefore, the storage cost is saved, and the searching and recommending quality is improved.

The modeling of the entity extraction link is identified by using a named entity identification technology in the existing method, the corpus in the specific field needs to be marked in advance, the link consumes time and labor, and the application range of the information extraction technology is limited. For the extraction result with the method, the concept which is not mentioned in the original cannot be restored, and the domain knowledge guidance is lacked, so that the extracted triples are sparse and have insufficient semantic consistency.

Disclosure of Invention

The application provides an information extraction method, an information extraction device and a storage medium, which can achieve the purposes of saving marking cost and enhancing the robustness of a system model.

The application provides an information extraction method, which comprises the following steps: word segmentation is carried out on the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text; extracting triples of entity relationships in the target text according to the dependency relationship tree, wherein the triples comprise: the verbs obtained by each word segmentation are respectively subjected to the following operations: and determining the verb as a predicate of the triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet.

In an exemplary embodiment, before the word segmentation is performed on the target text, the method further includes: and removing special characters in the target text.

According to an exemplary embodiment, before extracting the triples of the entity relationships in the target text according to the dependency tree, the method further includes: combining the partial words of noun parts of speech in the obtained dependency relationship tree by adopting a preset rule to obtain a dependency relationship tree after combining the combination of the blocks; the employing a predetermined rule for chunk merging includes at least one of: combining two or more consecutive proper nouns in a block manner; combining the proper noun and the non-proper noun connected after the proper noun; and combining the blocks of the word parts of nouns separated by a punctuation or conjunctive distance.

In an exemplary embodiment, the above-mentioned performing block merging on the word parts of nouns in the obtained dependency tree by using a predetermined rule to obtain a dependency tree after block merging, further includes: and taking the word obtained after the block combination as the noun part-of-speech word segmentation, carrying out the block combination according to the preset rule until no noun part-of-speech word segmentation capable of being combined exists, and obtaining a final dependency relation tree after the block combination.

According to an exemplary embodiment, before extracting the triples of the entity relationships in the target text according to the dependency tree, the method further includes: adopting a coreference resolution model, replacing a specified first type word in the dependency relationship tree after the combination of the blocks with nouns in the word analyzed by the coreference resolution model, and updating the dependency relationship tree; the specified first type of word includes at least one of: pronouns, names, and abbreviations.

According to an exemplary embodiment, the extracting the triples of the entity relationships in the target text according to the dependency tree includes: when a subject or an object in a triplet of a verb which is a predicate is not extracted from the dependency tree for one verb, a predetermined domain knowledge base is searched to determine the subject or the object by using a part of speech word part of a noun associated with the verb in the dependency tree.

In one exemplary embodiment, the extracting the triples of the entity relationships according to the dependency relationship tree further includes: removing words of a specified second type in the extracted relation triples and outputting the words; the words of the specified second type include one or more of the following: stop words and definite articles.

The application also provides a device for directing and delivering the content, which comprises a processor and a memory, wherein the memory stores a program for directing and delivering the content; the processor is configured to read the program for directing delivery of content, and perform the method of any of the embodiments.

The application also provides an information extraction device, which comprises: the word segmentation and analysis module is used for segmenting the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text; the extraction module is used for extracting triples of entity relations in the target text according to the dependency relation tree, and refers to: the extraction module is used for respectively carrying out the following operations on verbs obtained by each word segmentation: and determining the verb as a predicate of the triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet.

The application also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the embodiments.

Compared with the related art, the method and the device have the advantages that the target text is segmented, the part of speech is marked, the triples are extracted according to the marking result, the entity extraction can be performed by using the general part of speech marking and dependency relationship identification technology, the marking cost is saved, and meanwhile, the robustness of the system model is enhanced.

When the subject or the object in the triple with the verb as the predicate is not extracted from the dependency relationship tree, the application uses the word segmentation of the noun part of speech associated with the verb in the dependency relationship tree to search the preset domain knowledge base so as to determine the subject or the object, so that the relation extraction is more accurate.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. Other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide an understanding of the principles of the application, and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain, without limitation, the principles of the application.

FIG. 1 is a flow chart of an information extraction method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a target text word segmentation and labeling result according to a first embodiment of the present application;

fig. 3 is a block diagram of an information extraction device according to an embodiment of the application.

Detailed Description

The present application has been described in terms of several embodiments, but the description is illustrative and not restrictive, and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the described embodiments. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or in place of any other feature or element of any other embodiment unless specifically limited.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The disclosed embodiments, features and elements of the present application may also be combined with any conventional features or elements to form a unique inventive arrangement as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive arrangements to form another unique inventive arrangement as defined in the claims. It is therefore to be understood that any of the features shown and/or discussed in the present application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not to be restricted except in light of the attached claims and their equivalents. Further, various modifications and changes may be made within the scope of the appended claims.

Furthermore, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other sequences of steps are possible as will be appreciated by those of ordinary skill in the art. Accordingly, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

The technical scheme of the application will be described in more detail below with reference to the accompanying drawings and examples.

As shown in fig. 1, an information extraction method according to an embodiment of the present application includes the following steps:

s101, word segmentation is carried out on a target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment;

s102, analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text;

in the step S102, extracting a triplet of entity relationships in the target text according to the dependency tree, including: the verbs obtained by each word segmentation are respectively subjected to the following operations: and determining the verb as a predicate of the triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet.

Illustratively, the target text to be extracted is subjected to word segmentation, part-of-speech analysis and dependency tree analysis by using natural language processing software including, but not limited to, stanford CoreNLP, halftoning large language technology platform and the like.

In one exemplary embodiment, as shown in fig. 2, targeting text "bob has marie, she is then sent to the hospital. "this sentence is taken as an example, and 10 words" bob "," beaten "," marie "," she "," then "," sent "," hospital "," and "are obtained by performing word segmentation using the stanford CoreNLP labeling system, as shown in the third column of fig. 2. ". And then, part-of-speech analysis is carried out, and the analysis result is shown in a fifth column of FIG. 2, wherein the part-of-speech of each word is respectively corresponding to 'NR', 'VV', 'NR', 'PU', 'PN', 'AD', 'SB', 'NN', 'PU'. Wherein NR represents a proper noun; NN other nouns; VV represents a verb; PU represents an interval symbol. The above-mentioned word is labeled in a shorthand form commonly used in the computer field, and is not described here in detail.

In one exemplary embodiment, as shown in fig. 2, "bob strikes marie," she is then sent to the hospital. The verbs in this sentence include "hit", "sent", and the subject that has a dependency relationship with the verb "hit" is "bob", and the object is "marie", so the first extracted triplet result is: (Bob, made, mary); similarly, the second extracted triplet results in: (Mary, sent to, hospital).

In an exemplary embodiment, before the word segmentation of the target text, the method further includes step 104 of removing special characters in the target text.

For example, a special character library may be created and removed when there is a matching special character, such as when @, #, and the like characters appear.

In an exemplary embodiment, before extracting the triples of the entity relationships in the target text according to the dependency tree, the method further includes:

step S105, combining the noun part-of-speech word in the obtained dependency relationship tree by adopting a preset rule to obtain a dependency relationship tree after combining the combination of the blocks;

in an exemplary embodiment, in step S105, the performing block merging with the predetermined rule includes at least one of the following ways:

A. combining two or more consecutive proper nouns in a block manner;

B. combining the proper noun and the non-proper noun connected after the proper noun;

C. and combining the blocks of the word parts of nouns separated by a punctuation or conjunctive distance.

In an exemplary embodiment, in step S105, the performing block merging on the word parts of nouns in the obtained dependency tree by using a predetermined rule to obtain a block-merged dependency tree further includes:

and taking the word obtained after the block combination as the noun part-of-speech word segmentation, carrying out the block combination according to the preset rule until no noun part-of-speech word segmentation capable of being combined exists, and obtaining a final dependency relation tree after the block combination.

For example, the "artificial intelligence, big data and internet of things" is composed of three proper nouns of "artificial intelligence", "big data" and "internet of things", a punctuation, a conjunctive, and one other noun "technology", and this step constitutes the final block result by iteratively combining "internet of things" - > "big data and internet of things" - > "artificial intelligence, big data and internet of things". As also shown in FIG. 2, "Bob strikes Mary, she is then sent to the hospital. "this is taken as an example, and the result of the chunk analysis is shown in the sixth column in fig. 2. nsubj represents noun subject; ROOT represents a sentence of text to be processed; dobj represents a direct object; nsubj represents a noun subject; advmod represents adverb modification; auxpass represents a verb; conj represents connecting two parallel words; dobj represents a direct object; put represents an interval symbol.

s106, replacing a specified first type word in the dependency relationship tree after the combination of the blocks by nouns in the word analyzed by the coreference resolution model, and updating the dependency relationship tree;

in an exemplary embodiment, the specified first type of word includes at least one of: pronouns, names, and abbreviations.

In one exemplary embodiment, the pronouns in the text to be analyzed are replaced with the analysis results of the coreference resolution model by invoking a natural language processing software package. If "bob strikes marie," she is then sent to the hospital, "she" needs to be replaced with marie. Including but not limited to stanford CoreNLP, etc.

In an exemplary embodiment, the extracting the triples of entity relationships in the target text according to the dependency tree includes:

when a subject or an object in a triplet of a verb which is a predicate is not extracted from the dependency tree, a predetermined domain knowledge base is searched for and the subject or the object is determined by using a part of speech word part of a noun associated with the verb in the dependency tree.

In an exemplary embodiment, the extracting the triples of the entity relationships according to the dependency tree further includes: step S107, removing the words of the appointed second type in the extracted relation triples and outputting the words; the words of the specified second type include one or more of the following: stop words and definite articles. The extracted relation triples can be made more accurate.

As shown in fig. 1, an information extraction device according to an embodiment of the present application includes the following modules:

the word segmentation and analysis module 10 is used for segmenting the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text;

the extracting module 20 is configured to extract, according to the dependency tree, a triplet of entity relationships in the target text, which means: the extraction module 20 is configured to perform the following operations on verbs obtained by each word segmentation: and determining the verb as a predicate of the triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet.

In an exemplary embodiment, the apparatus further comprises:

the chunk merging module 30 is configured to perform chunk merging on the part of speech word segmentation in the obtained dependency tree by adopting a predetermined rule, so as to obtain a dependency tree after chunk merging;

in one exemplary embodiment, chunk merging module 30, employing predetermined rules for chunk merging includes at least one of:

A. the chunk merging module 30 performs chunk merging on two or more proper nouns in succession;

B. the block merging module 30 performs block merging on the proper noun and the non-proper noun connected after the proper noun;

C. the chunk assembly module 30 performs chunk assembly on the parts of speech of the noun parts of speech separated by a punctuation or conjunctive distance.

In an exemplary embodiment, the chunk merging module 30 is configured to perform chunk merging on the noun part-of-speech word in the obtained dependency tree by adopting a predetermined rule, to obtain a chunk-merged dependency tree, and is further configured to:

In an exemplary embodiment, the extracting module 20 is further configured to, before extracting the triples of the entity relationships in the target text according to the dependency tree:

adopting a coreference resolution model, replacing a specified first type word in the dependency relationship tree after the combination of the blocks with nouns in the word analyzed by the coreference resolution model, and updating the dependency relationship tree;

In an exemplary embodiment, the extracting module 20 is configured to extract, according to the dependency tree, a triplet of entity relationships in the target text, which refers to:

In an exemplary embodiment, the extraction module 20 is further configured to extract the triples of the entity relationships according to the dependency tree, remove words of the specified second type in the extracted triples of the relationships, and output the extracted triples; the words of the specified second type include one or more of the following: stop words and definite articles. The extracted relation triples can be made more accurate.

Embodiments of the present application provide a computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the above.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Claims

1. An information extraction method, comprising:

word segmentation is carried out on the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text;

extracting triples of entity relationships in the target text according to the dependency relationship tree, wherein the triples comprise: the verbs obtained by each word segmentation are respectively subjected to the following operations: determining the verb as a predicate of a triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet;

when a subject or an object in a triplet of a verb which is a predicate is not extracted from the dependency tree for one verb, a predetermined domain knowledge base is searched to determine the subject or the object by using a part of speech word part of a noun associated with the verb in the dependency tree.

2. The method of claim 1, wherein prior to the word segmentation of the target text, further comprising: and removing special characters in the target text.

3. The method of claim 1, wherein before extracting the triples of entity relationships in the target text from the dependency tree, further comprising:

combining the partial words of noun parts of speech in the obtained dependency relationship tree by adopting a preset rule to obtain a dependency relationship tree after combining the combination of the blocks; the word segmentation of the noun part of speech comprises proper nouns and non-proper nouns; the employing a predetermined rule for chunk merging includes at least one of:

combining two or more consecutive proper nouns in a block manner;

combining the proper noun and the non-proper noun connected after the proper noun;

and combining the blocks of the word parts of nouns separated by a punctuation or conjunctive distance.

4. The method of claim 3, wherein the performing block merging on the part of speech word segmentation in the obtained dependency tree using a predetermined rule to obtain a block-merged dependency tree further comprises:

5. A method according to claim 3, wherein before extracting the triples of entity relationships in the target text from the dependency tree further comprises:

the specified first type of word includes at least one of: pronouns, names, and abbreviations.

6. The method of claim 1, wherein performing the triplet extraction of the entity relationship from the dependency tree further comprises:

removing words of a specified second type in the extracted relation triples and outputting the words; the words of the specified second type include one or more of the following: stop words and definite articles.

7. An apparatus for directing delivery of content, comprising a processor and a memory, wherein the memory stores a program for directing delivery of content; the processor is configured to read the program for targeting content and perform the method of any of claims 1-6.

8. An information extraction apparatus, comprising:

the word segmentation and analysis module is used for segmenting the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text;

the extraction module is used for extracting the triples of the entity relations in the target text according to the dependency relation tree, and comprises the following steps: the verbs obtained for each word are respectively processed as follows: determining the verb as a predicate of a triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet;

9. A computer storage medium having stored thereon a computer program, which when executed by a processor implements the method according to any of claims 1-6.