CN111160030B - Information extraction method, device and storage medium - Google Patents

Information extraction method, device and storage medium Download PDF

Info

Publication number
CN111160030B
CN111160030B CN201911262829.8A CN201911262829A CN111160030B CN 111160030 B CN111160030 B CN 111160030B CN 201911262829 A CN201911262829 A CN 201911262829A CN 111160030 B CN111160030 B CN 111160030B
Authority
CN
China
Prior art keywords
dependency
word
tree
verb
target text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911262829.8A
Other languages
Chinese (zh)
Other versions
CN111160030A (en
Inventor
付骁弈
张�杰
吴信东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201911262829.8A priority Critical patent/CN111160030B/en
Publication of CN111160030A publication Critical patent/CN111160030A/en
Application granted granted Critical
Publication of CN111160030B publication Critical patent/CN111160030B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

An information extraction method comprises the steps of word segmentation of a target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text; according to the application, the entity extraction can be carried out by using the general part-of-speech tagging and dependency relationship identification technology according to the triples of entity relationships in the target text, so that the tagging cost is saved and the robustness of the system model is enhanced.

Description

Information extraction method, device and storage medium
Technical Field
The present application relates to computer technology, and in particular, to a method and apparatus for extracting information, and a storage medium.
Background
Information Extraction (IE) is the process of automatically extracting unstructured information embedded in text into structured data. Information extraction has wide application in the field of internet products and enterprise services. For example, in the process of constructing a search or recommendation engine, information extraction is first required for text content of a web page or a recommendation. The results of the information extraction can be used to de-duplicate documents and more accurately construct search indexes and recommendation features: therefore, the storage cost is saved, and the searching and recommending quality is improved.
The modeling of the entity extraction link is identified by using a named entity identification technology in the existing method, the corpus in the specific field needs to be marked in advance, the link consumes time and labor, and the application range of the information extraction technology is limited. For the extraction result with the method, the concept which is not mentioned in the original cannot be restored, and the domain knowledge guidance is lacked, so that the extracted triples are sparse and have insufficient semantic consistency.
Disclosure of Invention
The application provides an information extraction method, an information extraction device and a storage medium, which can achieve the purposes of saving marking cost and enhancing the robustness of a system model.
The application provides an information extraction method, which comprises the following steps: word segmentation is carried out on the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text; extracting triples of entity relationships in the target text according to the dependency relationship tree, wherein the triples comprise: the verbs obtained by each word segmentation are respectively subjected to the following operations: and determining the verb as a predicate of the triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet.
In an exemplary embodiment, before the word segmentation is performed on the target text, the method further includes: and removing special characters in the target text.
According to an exemplary embodiment, before extracting the triples of the entity relationships in the target text according to the dependency tree, the method further includes: combining the partial words of noun parts of speech in the obtained dependency relationship tree by adopting a preset rule to obtain a dependency relationship tree after combining the combination of the blocks; the employing a predetermined rule for chunk merging includes at least one of: combining two or more consecutive proper nouns in a block manner; combining the proper noun and the non-proper noun connected after the proper noun; and combining the blocks of the word parts of nouns separated by a punctuation or conjunctive distance.
In an exemplary embodiment, the above-mentioned performing block merging on the word parts of nouns in the obtained dependency tree by using a predetermined rule to obtain a dependency tree after block merging, further includes: and taking the word obtained after the block combination as the noun part-of-speech word segmentation, carrying out the block combination according to the preset rule until no noun part-of-speech word segmentation capable of being combined exists, and obtaining a final dependency relation tree after the block combination.
According to an exemplary embodiment, before extracting the triples of the entity relationships in the target text according to the dependency tree, the method further includes: adopting a coreference resolution model, replacing a specified first type word in the dependency relationship tree after the combination of the blocks with nouns in the word analyzed by the coreference resolution model, and updating the dependency relationship tree; the specified first type of word includes at least one of: pronouns, names, and abbreviations.
According to an exemplary embodiment, the extracting the triples of the entity relationships in the target text according to the dependency tree includes: when a subject or an object in a triplet of a verb which is a predicate is not extracted from the dependency tree for one verb, a predetermined domain knowledge base is searched to determine the subject or the object by using a part of speech word part of a noun associated with the verb in the dependency tree.
In one exemplary embodiment, the extracting the triples of the entity relationships according to the dependency relationship tree further includes: removing words of a specified second type in the extracted relation triples and outputting the words; the words of the specified second type include one or more of the following: stop words and definite articles.
The application also provides a device for directing and delivering the content, which comprises a processor and a memory, wherein the memory stores a program for directing and delivering the content; the processor is configured to read the program for directing delivery of content, and perform the method of any of the embodiments.
The application also provides an information extraction device, which comprises: the word segmentation and analysis module is used for segmenting the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text; the extraction module is used for extracting triples of entity relations in the target text according to the dependency relation tree, and refers to: the extraction module is used for respectively carrying out the following operations on verbs obtained by each word segmentation: and determining the verb as a predicate of the triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet.
The application also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the embodiments.
Compared with the related art, the method and the device have the advantages that the target text is segmented, the part of speech is marked, the triples are extracted according to the marking result, the entity extraction can be performed by using the general part of speech marking and dependency relationship identification technology, the marking cost is saved, and meanwhile, the robustness of the system model is enhanced.
When the subject or the object in the triple with the verb as the predicate is not extracted from the dependency relationship tree, the application uses the word segmentation of the noun part of speech associated with the verb in the dependency relationship tree to search the preset domain knowledge base so as to determine the subject or the object, so that the relation extraction is more accurate.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. Other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide an understanding of the principles of the application, and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain, without limitation, the principles of the application.
FIG. 1 is a flow chart of an information extraction method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a target text word segmentation and labeling result according to a first embodiment of the present application;
fig. 3 is a block diagram of an information extraction device according to an embodiment of the application.
Detailed Description
The present application has been described in terms of several embodiments, but the description is illustrative and not restrictive, and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the described embodiments. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or in place of any other feature or element of any other embodiment unless specifically limited.
The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The disclosed embodiments, features and elements of the present application may also be combined with any conventional features or elements to form a unique inventive arrangement as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive arrangements to form another unique inventive arrangement as defined in the claims. It is therefore to be understood that any of the features shown and/or discussed in the present application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not to be restricted except in light of the attached claims and their equivalents. Further, various modifications and changes may be made within the scope of the appended claims.
Furthermore, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other sequences of steps are possible as will be appreciated by those of ordinary skill in the art. Accordingly, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.
The technical scheme of the application will be described in more detail below with reference to the accompanying drawings and examples.
As shown in fig. 1, an information extraction method according to an embodiment of the present application includes the following steps:
s101, word segmentation is carried out on a target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment;
s102, analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text;
in the step S102, extracting a triplet of entity relationships in the target text according to the dependency tree, including: the verbs obtained by each word segmentation are respectively subjected to the following operations: and determining the verb as a predicate of the triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet.
Illustratively, the target text to be extracted is subjected to word segmentation, part-of-speech analysis and dependency tree analysis by using natural language processing software including, but not limited to, stanford CoreNLP, halftoning large language technology platform and the like.
In one exemplary embodiment, as shown in fig. 2, targeting text "bob has marie, she is then sent to the hospital. "this sentence is taken as an example, and 10 words" bob "," beaten "," marie "," she "," then "," sent "," hospital "," and "are obtained by performing word segmentation using the stanford CoreNLP labeling system, as shown in the third column of fig. 2. ". And then, part-of-speech analysis is carried out, and the analysis result is shown in a fifth column of FIG. 2, wherein the part-of-speech of each word is respectively corresponding to 'NR', 'VV', 'NR', 'PU', 'PN', 'AD', 'SB', 'NN', 'PU'. Wherein NR represents a proper noun; NN other nouns; VV represents a verb; PU represents an interval symbol. The above-mentioned word is labeled in a shorthand form commonly used in the computer field, and is not described here in detail.
In one exemplary embodiment, as shown in fig. 2, "bob strikes marie," she is then sent to the hospital. The verbs in this sentence include "hit", "sent", and the subject that has a dependency relationship with the verb "hit" is "bob", and the object is "marie", so the first extracted triplet result is: (Bob, made, mary); similarly, the second extracted triplet results in: (Mary, sent to, hospital).
In an exemplary embodiment, before the word segmentation of the target text, the method further includes step 104 of removing special characters in the target text.
For example, a special character library may be created and removed when there is a matching special character, such as when @, #, and the like characters appear.
In an exemplary embodiment, before extracting the triples of the entity relationships in the target text according to the dependency tree, the method further includes:
step S105, combining the noun part-of-speech word in the obtained dependency relationship tree by adopting a preset rule to obtain a dependency relationship tree after combining the combination of the blocks;
in an exemplary embodiment, in step S105, the performing block merging with the predetermined rule includes at least one of the following ways:
A. combining two or more consecutive proper nouns in a block manner;
B. combining the proper noun and the non-proper noun connected after the proper noun;
C. and combining the blocks of the word parts of nouns separated by a punctuation or conjunctive distance.
In an exemplary embodiment, in step S105, the performing block merging on the word parts of nouns in the obtained dependency tree by using a predetermined rule to obtain a block-merged dependency tree further includes:
and taking the word obtained after the block combination as the noun part-of-speech word segmentation, carrying out the block combination according to the preset rule until no noun part-of-speech word segmentation capable of being combined exists, and obtaining a final dependency relation tree after the block combination.
For example, the "artificial intelligence, big data and internet of things" is composed of three proper nouns of "artificial intelligence", "big data" and "internet of things", a punctuation, a conjunctive, and one other noun "technology", and this step constitutes the final block result by iteratively combining "internet of things" - > "big data and internet of things" - > "artificial intelligence, big data and internet of things". As also shown in FIG. 2, "Bob strikes Mary, she is then sent to the hospital. "this is taken as an example, and the result of the chunk analysis is shown in the sixth column in fig. 2. nsubj represents noun subject; ROOT represents a sentence of text to be processed; dobj represents a direct object; nsubj represents a noun subject; advmod represents adverb modification; auxpass represents a verb; conj represents connecting two parallel words; dobj represents a direct object; put represents an interval symbol.
In an exemplary embodiment, before extracting the triples of the entity relationships in the target text according to the dependency tree, the method further includes:
s106, replacing a specified first type word in the dependency relationship tree after the combination of the blocks by nouns in the word analyzed by the coreference resolution model, and updating the dependency relationship tree;
in an exemplary embodiment, the specified first type of word includes at least one of: pronouns, names, and abbreviations.
In one exemplary embodiment, the pronouns in the text to be analyzed are replaced with the analysis results of the coreference resolution model by invoking a natural language processing software package. If "bob strikes marie," she is then sent to the hospital, "she" needs to be replaced with marie. Including but not limited to stanford CoreNLP, etc.
In an exemplary embodiment, the extracting the triples of entity relationships in the target text according to the dependency tree includes:
when a subject or an object in a triplet of a verb which is a predicate is not extracted from the dependency tree, a predetermined domain knowledge base is searched for and the subject or the object is determined by using a part of speech word part of a noun associated with the verb in the dependency tree.
In an exemplary embodiment, the extracting the triples of the entity relationships according to the dependency tree further includes: step S107, removing the words of the appointed second type in the extracted relation triples and outputting the words; the words of the specified second type include one or more of the following: stop words and definite articles. The extracted relation triples can be made more accurate.
As shown in fig. 1, an information extraction device according to an embodiment of the present application includes the following modules:
the word segmentation and analysis module 10 is used for segmenting the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text;
the extracting module 20 is configured to extract, according to the dependency tree, a triplet of entity relationships in the target text, which means: the extraction module 20 is configured to perform the following operations on verbs obtained by each word segmentation: and determining the verb as a predicate of the triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet.
Illustratively, the target text to be extracted is subjected to word segmentation, part-of-speech analysis and dependency tree analysis by using natural language processing software including, but not limited to, stanford CoreNLP, halftoning large language technology platform and the like.
In one exemplary embodiment, as shown in fig. 2, targeting text "bob has marie, she is then sent to the hospital. "this sentence is taken as an example, and 10 words" bob "," beaten "," marie "," she "," then "," sent "," hospital "," and "are obtained by performing word segmentation using the stanford CoreNLP labeling system, as shown in the third column of fig. 2. ". And then, part-of-speech analysis is carried out, and the analysis result is shown in a fifth column of FIG. 2, wherein the part-of-speech of each word is respectively corresponding to 'NR', 'VV', 'NR', 'PU', 'PN', 'AD', 'SB', 'NN', 'PU'. Wherein NR represents a proper noun; NN other nouns; VV represents a verb; PU represents an interval symbol. The above-mentioned word is labeled in a shorthand form commonly used in the computer field, and is not described here in detail.
In one exemplary embodiment, as shown in fig. 2, "bob strikes marie," she is then sent to the hospital. The verbs in this sentence include "hit", "sent", and the subject that has a dependency relationship with the verb "hit" is "bob", and the object is "marie", so the first extracted triplet result is: (Bob, made, mary); similarly, the second extracted triplet results in: (Mary, sent to, hospital).
In an exemplary embodiment, before the word segmentation of the target text, the method further includes step 104 of removing special characters in the target text.
For example, a special character library may be created and removed when there is a matching special character, such as when @, #, and the like characters appear.
In an exemplary embodiment, the apparatus further comprises:
the chunk merging module 30 is configured to perform chunk merging on the part of speech word segmentation in the obtained dependency tree by adopting a predetermined rule, so as to obtain a dependency tree after chunk merging;
in one exemplary embodiment, chunk merging module 30, employing predetermined rules for chunk merging includes at least one of:
A. the chunk merging module 30 performs chunk merging on two or more proper nouns in succession;
B. the block merging module 30 performs block merging on the proper noun and the non-proper noun connected after the proper noun;
C. the chunk assembly module 30 performs chunk assembly on the parts of speech of the noun parts of speech separated by a punctuation or conjunctive distance.
In an exemplary embodiment, the chunk merging module 30 is configured to perform chunk merging on the noun part-of-speech word in the obtained dependency tree by adopting a predetermined rule, to obtain a chunk-merged dependency tree, and is further configured to:
and taking the word obtained after the block combination as the noun part-of-speech word segmentation, carrying out the block combination according to the preset rule until no noun part-of-speech word segmentation capable of being combined exists, and obtaining a final dependency relation tree after the block combination.
For example, the "artificial intelligence, big data and internet of things" is composed of three proper nouns of "artificial intelligence", "big data" and "internet of things", a punctuation, a conjunctive, and one other noun "technology", and this step constitutes the final block result by iteratively combining "internet of things" - > "big data and internet of things" - > "artificial intelligence, big data and internet of things". As also shown in FIG. 2, "Bob strikes Mary, she is then sent to the hospital. "this is taken as an example, and the result of the chunk analysis is shown in the sixth column in fig. 2. nsubj represents noun subject; ROOT represents a sentence of text to be processed; dobj represents a direct object; nsubj represents a noun subject; advmod represents adverb modification; auxpass represents a verb; conj represents connecting two parallel words; dobj represents a direct object; put represents an interval symbol.
In an exemplary embodiment, the extracting module 20 is further configured to, before extracting the triples of the entity relationships in the target text according to the dependency tree:
adopting a coreference resolution model, replacing a specified first type word in the dependency relationship tree after the combination of the blocks with nouns in the word analyzed by the coreference resolution model, and updating the dependency relationship tree;
in an exemplary embodiment, the specified first type of word includes at least one of: pronouns, names, and abbreviations.
In one exemplary embodiment, the pronouns in the text to be analyzed are replaced with the analysis results of the coreference resolution model by invoking a natural language processing software package. If "bob strikes marie," she is then sent to the hospital, "she" needs to be replaced with marie. Including but not limited to stanford CoreNLP, etc.
In an exemplary embodiment, the extracting module 20 is configured to extract, according to the dependency tree, a triplet of entity relationships in the target text, which refers to:
when a subject or an object in a triplet of a verb which is a predicate is not extracted from the dependency tree, a predetermined domain knowledge base is searched for and the subject or the object is determined by using a part of speech word part of a noun associated with the verb in the dependency tree.
In an exemplary embodiment, the extraction module 20 is further configured to extract the triples of the entity relationships according to the dependency tree, remove words of the specified second type in the extracted triples of the relationships, and output the extracted triples; the words of the specified second type include one or more of the following: stop words and definite articles. The extracted relation triples can be made more accurate.
The application also provides a device for directing and delivering the content, which comprises a processor and a memory, wherein the memory stores a program for directing and delivering the content; the processor is configured to read the program for directing delivery of content, and perform the method of any of the embodiments.
Embodiments of the present application provide a computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the above.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Claims (9)

1. An information extraction method, comprising:
word segmentation is carried out on the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text;
extracting triples of entity relationships in the target text according to the dependency relationship tree, wherein the triples comprise: the verbs obtained by each word segmentation are respectively subjected to the following operations: determining the verb as a predicate of a triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet;
when a subject or an object in a triplet of a verb which is a predicate is not extracted from the dependency tree for one verb, a predetermined domain knowledge base is searched to determine the subject or the object by using a part of speech word part of a noun associated with the verb in the dependency tree.
2. The method of claim 1, wherein prior to the word segmentation of the target text, further comprising: and removing special characters in the target text.
3. The method of claim 1, wherein before extracting the triples of entity relationships in the target text from the dependency tree, further comprising:
combining the partial words of noun parts of speech in the obtained dependency relationship tree by adopting a preset rule to obtain a dependency relationship tree after combining the combination of the blocks; the word segmentation of the noun part of speech comprises proper nouns and non-proper nouns; the employing a predetermined rule for chunk merging includes at least one of:
combining two or more consecutive proper nouns in a block manner;
combining the proper noun and the non-proper noun connected after the proper noun;
and combining the blocks of the word parts of nouns separated by a punctuation or conjunctive distance.
4. The method of claim 3, wherein the performing block merging on the part of speech word segmentation in the obtained dependency tree using a predetermined rule to obtain a block-merged dependency tree further comprises:
and taking the word obtained after the block combination as the noun part-of-speech word segmentation, carrying out the block combination according to the preset rule until no noun part-of-speech word segmentation capable of being combined exists, and obtaining a final dependency relation tree after the block combination.
5. A method according to claim 3, wherein before extracting the triples of entity relationships in the target text from the dependency tree further comprises:
adopting a coreference resolution model, replacing a specified first type word in the dependency relationship tree after the combination of the blocks with nouns in the word analyzed by the coreference resolution model, and updating the dependency relationship tree;
the specified first type of word includes at least one of: pronouns, names, and abbreviations.
6. The method of claim 1, wherein performing the triplet extraction of the entity relationship from the dependency tree further comprises:
removing words of a specified second type in the extracted relation triples and outputting the words; the words of the specified second type include one or more of the following: stop words and definite articles.
7. An apparatus for directing delivery of content, comprising a processor and a memory, wherein the memory stores a program for directing delivery of content; the processor is configured to read the program for targeting content and perform the method of any of claims 1-6.
8. An information extraction apparatus, comprising:
the word segmentation and analysis module is used for segmenting the target text; performing part-of-speech tagging on each word segment to obtain a part-of-speech tagging result of each word segment; analyzing the dependency tree according to the part-of-speech tagging result of each word segment to obtain a dependency relationship tree of all the word segments of the target text;
the extraction module is used for extracting the triples of the entity relations in the target text according to the dependency relation tree, and comprises the following steps: the verbs obtained for each word are respectively processed as follows: determining the verb as a predicate of a triplet, traversing nouns related to the verb in the dependency tree by taking the verb as a root node, determining an entity taking the dependency of the verb as a subject of the triplet, and determining an entity taking the dependency of the verb as a subject of the triplet;
when a subject or an object in a triplet of a verb which is a predicate is not extracted from the dependency tree for one verb, a predetermined domain knowledge base is searched to determine the subject or the object by using a part of speech word part of a noun associated with the verb in the dependency tree.
9. A computer storage medium having stored thereon a computer program, which when executed by a processor implements the method according to any of claims 1-6.
CN201911262829.8A 2019-12-11 2019-12-11 Information extraction method, device and storage medium Active CN111160030B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911262829.8A CN111160030B (en) 2019-12-11 2019-12-11 Information extraction method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911262829.8A CN111160030B (en) 2019-12-11 2019-12-11 Information extraction method, device and storage medium

Publications (2)

Publication Number Publication Date
CN111160030A CN111160030A (en) 2020-05-15
CN111160030B true CN111160030B (en) 2023-09-19

Family

ID=70556890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911262829.8A Active CN111160030B (en) 2019-12-11 2019-12-11 Information extraction method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111160030B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230140938A1 (en) * 2020-04-10 2023-05-11 Nippon Telegraph And Telephone Corporation Sentence data analysis information generation device using ontology, sentence data analysis information generation method, and sentence data analysis information generation program

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814466A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Information extraction method based on machine reading understanding and related equipment thereof
CN112948536A (en) * 2020-11-09 2021-06-11 袭明科技(广东)有限公司 Information extraction method and device for web resume page
CN112269884B (en) * 2020-11-13 2024-03-05 北京百度网讯科技有限公司 Information extraction method, device, equipment and storage medium
CN113468878A (en) * 2021-07-13 2021-10-01 腾讯科技(深圳)有限公司 Part-of-speech tagging method and device, electronic equipment and storage medium
CN114186552B (en) * 2021-12-13 2023-04-07 北京百度网讯科技有限公司 Text analysis method, device and equipment and computer storage medium
CN116484870B (en) * 2022-09-09 2024-01-05 北京百度网讯科技有限公司 Method, device, equipment and medium for extracting text information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010050675A2 (en) * 2008-10-29 2010-05-06 한국과학기술원 Method for automatically extracting relation triplets through a dependency grammar parse tree
CN104573028A (en) * 2015-01-14 2015-04-29 百度在线网络技术(北京)有限公司 Intelligent question-answer implementing method and system
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method
CN108363816A (en) * 2018-03-21 2018-08-03 北京理工大学 Open entity relation extraction method based on sentence justice structural model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010050675A2 (en) * 2008-10-29 2010-05-06 한국과학기술원 Method for automatically extracting relation triplets through a dependency grammar parse tree
CN104573028A (en) * 2015-01-14 2015-04-29 百度在线网络技术(北京)有限公司 Intelligent question-answer implementing method and system
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method
CN108363816A (en) * 2018-03-21 2018-08-03 北京理工大学 Open entity relation extraction method based on sentence justice structural model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于组块的中文自动文摘系统研究;索红光等;《计算机系统应用》;20070331(第03期);第97-100页 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230140938A1 (en) * 2020-04-10 2023-05-11 Nippon Telegraph And Telephone Corporation Sentence data analysis information generation device using ontology, sentence data analysis information generation method, and sentence data analysis information generation program

Also Published As

Publication number Publication date
CN111160030A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN111160030B (en) Information extraction method, device and storage medium
Nothman et al. Learning multilingual named entity recognition from Wikipedia
JP6749110B2 (en) Language identification in social media
US10956662B2 (en) List manipulation in natural language processing
CN107247707B (en) Enterprise association relation information extraction method and device based on completion strategy
CN110874531A (en) Topic analysis method and device and storage medium
CN109582799B (en) Method and device for determining knowledge sample data set and electronic equipment
US10013404B2 (en) Targeted story summarization using natural language processing
CN111178079B (en) Triplet extraction method and device
CN111737499B (en) Data searching method based on natural language processing and related equipment
CN109145110B (en) Label query method and device
US20100161655A1 (en) System for string matching based on segmentation method and method thereof
US10592236B2 (en) Documentation for version history
GB2555207A (en) System and method for identifying passages in electronic documents
US20190179888A1 (en) Data standardization rules generation
WO2020020287A1 (en) Text similarity acquisition method, apparatus, device, and readable storage medium
US9779363B1 (en) Disambiguating personal names
Rehman et al. Morpheme matching based text tokenization for a scarce resourced language
CN104281716A (en) Parallel corpus alignment method and device
CN105446986A (en) Web page processing method and device
US11182545B1 (en) Machine learning on mixed data documents
CN108875743B (en) Text recognition method and device
CN111133429A (en) Extracting expressions for natural language processing
CN111046627A (en) Chinese character display method and system
US9946765B2 (en) Building a domain knowledge and term identity using crowd sourcing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant