CN112613321A

CN112613321A - Method and system for extracting entity attribute information in text

Info

Publication number: CN112613321A
Application number: CN202011493868.1A
Authority: CN
Inventors: 张子彪; 张御昊; 邵子豪
Original assignee: Nanjing Digital Information Technology Co ltd
Current assignee: Nanjing Digital Information Technology Co ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2021-04-06

Abstract

The invention discloses a method and a system for extracting entity attribute information in a text, wherein the method comprises the following steps: acquiring a target text, segmenting the target text, confirming whether the segmented target text contains entity keywords or not, if so, determining the part of speech of a first target segmentation in front of the entity keywords, confirming whether the part of speech of the first target segmentation meets preset entity attribute rules or not, if so, adding the first target segmentation into first target entity attribute information, performing entity extraction again in the target text by using the first target entity attribute information as a parameter to obtain an extraction result, determining second target entity attribute information of the target text according to the extraction result, the extracted first target entity attribute information is verified, and simultaneously secondary entity extraction can be carried out on the target text to ensure the integrity of the final entity attribute information, so that the occurrence of entity extraction missing is avoided.

Description

Method and system for extracting entity attribute information in text

Technical Field

The invention relates to the technical field of text processing, in particular to a method and a system for extracting entity attribute information in a text.

Background

The online encyclopedia, also called network encyclopedia, is an encyclopedia which is disclosed on the internet and consulted by net friends, and the network encyclopedia has two types of openness and non-openness. The user can conveniently inquire various information resources in time by utilizing the on-line encyclopedia. Meanwhile, because netizens participate in the construction of open encyclopedias, the information of the online encyclopedias is more open, transparent, richer and perfect. The famous open network encyclopedias are: wikipedia, popular encyclopedia, interactive encyclopedia, and the like.

The online encyclopedia is used for describing various entities for users to inquire. An entity refers to an objective thing in the real world, and is any distinguishable and identifiable thing in the real world. An entity may refer not only to an accessible objective object, but also to an abstract event. The entity attributes refer to some basic characteristic characteristics of the entities, the entity attributes are beneficial for people to comprehensively and objectively know the entities, the more the entity attributes are, the more detailed the description of the entities is, and therefore, the extraction of the entity attributes has important significance. In a knowledge base, in order to facilitate machine understanding of knowledge, relationships and attributes of entities (collectively referred to as entity relationships) are usually mapped to relationships predefined by a Schema (Schema), and entity attribute information needs to be extracted from a text, and an existing extraction method mainly extracts entity attribute information from the text by using entity extraction rules, but the method has the following problems: the entity attribute information extracted by using the entity rule may be incomplete, and the extracted entity attribute information is not verified or secondarily extracted in the prior art, so that the final situation that the entity attribute information is incomplete and the like occurs, and the experience of a user is seriously influenced.

Disclosure of Invention

Aiming at the displayed problems, the invention provides a method for extracting entity attribute information in a text, which is used for solving the problems that the entity attribute information extracted by using an entity rule is possibly incomplete in the background art, and the final entity attribute information is incomplete and the like due to the fact that the extracted entity attribute information is not verified or extracted for the second time in the prior art, so that the experience of a user is seriously influenced.

A method for extracting entity attribute information in a text comprises the following steps:

acquiring a target text, and segmenting words of the target text;

confirming whether the segmented target text contains entity keywords or not, if so, determining the part of speech of a first target segmented word in front of the entity keywords;

confirming whether the part of speech of the first target participle accords with a preset entity attribute rule, if so, adding the first target participle into first target entity attribute information;

entity extraction is carried out in the target text again by using the first target entity attribute information as a parameter, and an extraction result is obtained;

and determining second target entity attribute information of the target text according to the extraction result.

Preferably, the obtaining the target text and performing word segmentation on the target text includes:

checking the integrity of the target text;

after the examination is finished, taking sentences as units of the target text, and acquiring a plurality of first word segmentation sets of sentences formed by the target text;

performing uniqueness check on the word segmentation result in each first word segmentation set, and deleting second target words which appear repeatedly to obtain a plurality of second word segmentation sets;

and sequencing and displaying the plurality of second word segmentation sets according to the sequence of the composition sentences to obtain a target word segmentation list of the target text.

Preferably, the step of determining whether the segmented target text contains an entity keyword, and if yes, determining a part of speech of a first target segment ahead of the entity keyword includes:

analyzing the text content of the target text;

extracting a target semantic relation in the text content, and selecting a plurality of first candidate keywords which accord with the target semantic relation weight in the target text;

vectorizing the target text by using a preset vector space model to construct a vector matrix;

inputting the plurality of first candidate keywords into the vector matrix to determine a target first candidate keyword with the maximum relevance with the matrix parameters in the vector matrix as the entity keyword;

confirming the target position of the target first candidate keyword in the target word segmentation list, and acquiring N first target word segmentations before the target first candidate keyword based on the target position;

and calling a preset word bank to analyze the part of speech of each first target word segmentation.

Preferably, the determining whether the part of speech of the first target participle meets a preset entity attribute rule, if yes, adding the first target participle to first target entity attribute information includes:

determining a target language class of the target text;

constructing a target language class vector model, and judging the feature vector of the part of speech of the first target participle by using the target language class vector model;

performing feature extraction on the part-of-speech feature vector of the first target word segmentation to obtain target features of the part-of-speech feature vector of the first target word segmentation;

and confirming whether the composition type and the characteristic parameters of the target characteristics accord with the preset entity attribute rule, if so, adding the first target word segmentation into first target entity attribute information of a target text.

Preferably, the extracting the entity again in the target text by using the first target entity attribute information as a parameter to obtain an extraction result includes:

counting the number of first target word segmentation targets in the first target entity attribute information;

sorting the first target word segmentation with the target number according to the structure length to obtain a sorting result;

detecting the text content relevance in the target text by using the sequencing result to obtain a target relevance detection result, confirming that entity extraction is not required to be carried out on the target text again when the target relevance detection result is that the relevance is greater than or equal to a first preset threshold value, and otherwise, confirming that the entity extraction is required to be carried out on the target text again;

and performing secondary entity extraction on the target text by taking the sequencing result as an extraction parameter to obtain an extraction result.

Preferably, the determining the second target entity attribute information of the target text according to the extraction result includes:

when the extraction result is that no new second target word segmentation is extracted, confirming that the first target entity attribute information is the final target entity attribute information of the target text;

and when the extraction result is that a new second target word segmentation exists, adding the second target word segmentation into the first target entity attribute information to obtain second target entity attribute information, and confirming the second target entity attribute information as final target entity attribute information of a target text.

Preferably, the method further comprises:

step A1, carrying out syntactic analysis on the target text, and determining a complete syntactic tree of the target text;

step A2, storing the final target entity attribute information of the target text into the complete syntax tree, and extracting a path enveloping function tree between every two first target participles and/or second target participles;

step A3, determining a shortest first target path enveloping function tree in a plurality of path enveloping function trees, and acquiring semantic information of two corresponding target first target participles or two corresponding target second target participles or a target first target participle and a target second target participle based on the shortest first target path enveloping function tree;

step A4, storing semantic information of the two target first target participles or the two target second target participles or the target first target participles and the target second target participles under a first node of a shortest first target path enveloping function tree;

step A5, determining a second short target path enveloping function tree in the plurality of path containing trees, and repeating the steps A3-A4 until semantic information of the first target participle and/or the second target participle corresponding to each target enveloping function tree is stored in the first node of each target enveloping function tree.

Preferably, the determining whether the segmented target text contains an entity keyword, if yes, determining a part of speech of a first target segment ahead of the entity keyword, includes:

analyzing the segmented target text into a label tree structure;

carrying out paragraph division on the target text to divide the target text into a plurality of paragraph blocks;

the plurality of segment blocks correspond to a plurality of second nodes of the label tree structure, and a target number of target second nodes corresponding to each segment block is determined;

obtaining the label attribute of each segment block according to the node attribute of the target second nodes with the target number corresponding to each segment block;

determining a target keyword related to the label attribute in each paragraph block according to the label attribute of each paragraph block;

counting the number of all target keywords to obtain M target keywords;

performing document search by using each target keyword in the M target keywords to obtain a target document searched by each target keyword;

analyzing each target document to obtain parameter information of each target document;

filling the target texts according to the parameter information of each target document to obtain M filled target texts;

calculating value indexes generated by each filled target text relative to the target text to obtain M value indexes;

selecting the largest target value index from the M value indexes, determining a first target keyword of the target value index, and confirming the first target keyword as the entity keyword;

determining a plurality of second target keywords preceding the first target keyword;

confirming the second target keyword as the first target word segmentation;

analyzing the target part of speech of each first target word segmentation by using a preset part of speech analysis mode;

and after the analysis is finished, associating each first target participle with the corresponding target part-of-speech.

Preferably, the determining, according to the tag attribute of each paragraph block, a target keyword in each paragraph block, which is related to the tag attribute, includes:

determining the number of second candidate keywords of each segment block;

acquiring a keyword parameter of each second candidate keyword;

calculating the target association degree of each second candidate keyword and the target label attribute of the target segment falling block to which the second candidate keyword belongs according to the keyword parameter of each second candidate keyword:

wherein k is_ijThe target association degree of the target label attribute of the jth second candidate keyword and the ith paragraph block in the ith paragraph block is represented, S represents the parameter value of the gentle parameter, the value is 0.6, Q represents the number of the divided paragraph blocks, S is less than Q, represents the content complexity of the ith paragraph block, and the values are [0.5,0.9 ]]，G_ijParameter value, f, of a keyword parameter expressed as a jth second candidate keyword within an ith paragraph block_iA parameter value of a parameter expressed as a target tag attribute of the ith segment drop;

screening first target second candidate keywords with the target relevance greater than or equal to the preset relevance to obtain the current number of first target second candidate keywords corresponding to each segment of block;

analyzing the characteristic factor of each first target second candidate keyword;

calculating the dependency of each paragraph block on the feature factor of each first target second candidate keyword corresponding to each paragraph block by using the feature factor of each first target second candidate keyword corresponding to each paragraph block:

wherein, T_iqExpressed as the dependency of the ith paragraph block on the characteristic factor of the qth first target second candidate keyword in the ith paragraph block, Y_iqExpressing fuzzy similarity, k, of the characteristic factor of the qth first target second candidate keyword in the ith paragraph block to the parameter of the target label attribute of the ith paragraph block_iqRepresenting the target association degree of the q-th first target second candidate keyword in the ith paragraph block and the target label attribute of the ith paragraph block, R_iqA knowledge expansion coefficient represented as a feature factor of the qth first target second candidate keyword in the ith paragraph block, F () represented as a paragraph block content scoring function, x_i1The richness, x, of the text content with the importance degree greater than or equal to a second preset threshold value in the ith paragraph_i2The richness of the text content with the importance degree smaller than the second preset threshold value in the ith paragraph is represented;

screening out second target second candidate keywords of which the dependency degrees are greater than or equal to a third preset threshold value from the dependency degrees of the feature factors of each paragraph block for each corresponding first target second candidate keyword;

and determining the second target second candidate keyword corresponding to each paragraph block as the target keyword related to the label attribute in each paragraph block.

A system for extracting entity attribute information from text, the system comprising:

the word segmentation module is used for acquiring a target text and segmenting words of the target text;

the first determining module is used for determining whether the segmented target text contains entity keywords or not, and if so, determining the part of speech of a first target segmented word in front of the entity keywords;

the adding module is used for confirming whether the part of speech of the first target word segmentation meets a preset entity attribute rule, and if so, adding the first target word segmentation into first target entity attribute information;

the acquisition module is used for extracting the entity again in the target text by using the first target entity attribute information as a parameter to acquire an extraction result;

and the second determining module is used for determining second target entity attribute information of the target text according to the extraction result.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

Fig. 1 is a flowchart illustrating a method for extracting entity attribute information from a text according to the present invention;

FIG. 2 is another flowchart illustrating a method for extracting entity attribute information from a text according to the present invention;

FIG. 3 is a flowchart illustrating a method for extracting entity attribute information from a text according to another embodiment of the present invention;

fig. 4 is a schematic structural diagram of a system for extracting entity attribute information from a text according to the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The online encyclopedia is used for describing various entities for users to inquire. An entity refers to an objective thing in the real world, and is any distinguishable and identifiable thing in the real world. An entity may refer not only to an accessible objective object, but also to an abstract event. The entity attributes refer to some basic characteristic characteristics of the entities, the entity attributes are beneficial for people to comprehensively and objectively know the entities, the more the entity attributes are, the more detailed the description of the entities is, and therefore, the extraction of the entity attributes has important significance. In a knowledge base, in order to facilitate machine understanding of knowledge, relationships and attributes of entities (collectively referred to as entity relationships) are usually mapped to relationships predefined by a Schema (Schema), and entity attribute information needs to be extracted from a text, and an existing extraction method mainly extracts entity attribute information from the text by using entity extraction rules, but the method has the following problems: the entity attribute information extracted by using the entity rule may be incomplete, and the extracted entity attribute information is not verified or secondarily extracted in the prior art, so that the final situation that the entity attribute information is incomplete and the like occurs, and the experience of a user is seriously influenced. In order to solve the above problem, this embodiment discloses a method for extracting entity attribute information in a text.

A method for extracting entity attribute information in a text, as shown in fig. 1, includes the following steps:

s101, acquiring a target text, and segmenting words of the target text;

step S102, confirming whether the segmented target text contains entity keywords or not, and if so, determining the part of speech of a first target segmented word in front of the entity keywords;

step S103, confirming whether the part of speech of the first target word segmentation accords with a preset entity attribute rule, and if so, adding the first target word segmentation into first target entity attribute information;

step S104, entity extraction is carried out again in the target text by using the first target entity attribute information as a parameter to obtain an extraction result;

and step S105, determining second target entity attribute information of the target text according to the extraction result.

The working principle of the technical scheme is as follows: the method comprises the steps of obtaining a target text, segmenting the target text, confirming whether the segmented target text contains entity keywords or not, if yes, confirming the part of speech of a first target segmentation in front of the entity keywords, confirming whether the part of speech of the first target segmentation meets preset entity attribute rules or not, if yes, adding the first target segmentation into first target entity attribute information, extracting entities in the target text again by using the first target entity attribute information as a parameter, obtaining an extraction result, and confirming second target entity attribute information of the target text according to the extraction result.

The beneficial effects of the above technical scheme are: the extracted first target entity attribute information can be verified by taking the first target entity attribute information as a parameter and simultaneously the target text can be subjected to secondary entity extraction to ensure the integrity of the final entity attribute information, so that the occurrence of entity extraction missing is avoided, the problem that the extracted entity attribute information is not verified or secondarily extracted in the prior art, the conditions of incomplete final entity attribute information and the like occur is effectively solved, and the use experience of a user is improved.

In an embodiment, as shown in fig. 2, the obtaining a target text and performing word segmentation on the target text includes:

step S201, checking the integrity of the target text;

step S202, after the examination is finished, obtaining a plurality of first word segmentation sets of sentences formed by the target text by taking sentences as units;

step S203, carrying out uniqueness check on the word segmentation result in each first word segmentation set, and deleting second target words which repeatedly appear to obtain a plurality of second word segmentation sets;

and S204, sequencing and displaying the plurality of second word segmentation sets according to the sequence of the composition sentences to obtain a target word segmentation list of the target text.

The beneficial effects of the above technical scheme are: the integrity of the target text is checked to ensure that the entity attribute information can be accurately and completely extracted, and the integrity of the extracted entity attribute information is also ensured.

In one embodiment, determining whether the segmented target text contains an entity keyword, and if so, determining a part of speech of a first target segment preceding the entity keyword, including:

analyzing the text content of the target text;

The beneficial effects of the above technical scheme are: the entity keywords are determined from the first candidate keywords by utilizing the vector matrix, so that the entity keywords can be visually determined from the first candidate keywords according to the matrix elements in the vector matrix, the accuracy is higher, furthermore, the part of speech of the first target participles is analyzed by utilizing the preset word bank, the part of speech of each first target participle does not need to be analyzed manually, the intelligent analysis of the part of speech is realized, and the manpower is further saved.

In one embodiment, as shown in fig. 3, the determining whether the part of speech of the first target participle meets a preset entity attribute rule, and if yes, adding the first target participle to first target entity attribute information includes:

s301, determining a target language class of the target text;

step S302, a target language class vector model is constructed, and the feature vector of the part of speech of the first target participle is judged by using the target language class vector model;

step S303, extracting the characteristic vector of the part of speech of the first target word segmentation to obtain the target characteristic of the part of speech characteristic vector of the first target word segmentation;

step S304, whether the composition type and the characteristic parameters of the target characteristics accord with the preset entity attribute rule or not is confirmed, and if yes, the first target word segmentation is added into first target entity attribute information of a target text.

The beneficial effects of the above technical scheme are: the method has the advantages that the target language class vector model is built by determining the target language class of the target text, different language class vector models can be built in real time according to different languages, the situation that characteristic vector judgment cannot be conducted on the current language class of the target text due to the fact that the pre-built model is not capable of conducting feature vector judgment on the current language class of the target text is avoided, practicability is improved, furthermore, whether the first target participle can be added into the first target entity attribute information of the target text or not can be accurately determined according to the feature factor or not by confirming whether the composition type and the feature parameters of the target feature meet the preset entity attribute rule or not, and compared with the prior art that whether the participle meets the rule or not is directly judged, the method is fine and accurate.

In an embodiment, the extracting the entity again in the target text by using the first target entity attribute information as a parameter to obtain an extraction result includes:

The beneficial effects of the above technical scheme are: by utilizing the text content association degree between the first target entity attribute information and the target text, whether secondary entity extraction needs to be carried out on the target text can be determined, the verification step is further realized, and the accuracy of data and the integrity of final entity attribute information are ensured.

In one embodiment, the determining second target entity attribute information of the target text according to the extraction result includes:

The beneficial effects of the above technical scheme are: and the final target entity attribute information of the target text is further improved.

In one embodiment, the method further comprises:

The beneficial effects of the above technical scheme are: every two target participles can be associated by corresponding the first node of each envelope tree to the respective two target participles, so that the condition that no missing target participles are ignored when entity attribute information is searched is ensured, a complete target text can be ensured to appear when the entity attribute information is searched, and the use experience of a user is further improved.

In one embodiment, the determining whether the segmented target text contains an entity keyword, and if so, determining a part of speech of a first target segment preceding the entity keyword includes:

analyzing the segmented target text into a label tree structure;

counting the number of all target keywords to obtain M target keywords;

confirming the second target keyword as the first target word segmentation;

The beneficial effects of the above technical scheme are: the target text can be divided into a plurality of combined bodies from a whole body to accurately obtain the target keywords of each paragraph block by determining the target keywords in each paragraph block by using the label attributes of the paragraph blocks, so that the accuracy rate of determining the entity keywords of the target text is improved, further, the value index of each target keyword obtained by performing document search by using each target keyword and filling the target text can determine the most important target keyword for the target text according to the value index, the target keyword can be determined as the entity keyword, and data support is provided for determining the entity keyword.

In one embodiment, the determining a target keyword in each paragraph block related to the tag attribute according to the tag attribute of each paragraph block includes:

determining the number of second candidate keywords of each segment block;

acquiring a keyword parameter of each second candidate keyword;

wherein k is_ijExpressed as the jth second candidate keyword in the ith paragraph blockThe target relevance of the target label attribute of the ith segment block is S, which is the parameter value of the gentle parameter and takes the value of 0.6, Q is the number of the divided segment blocks, S is less than Q, B_iThe content complexity expressed as the ith segment block is [0.5,0.9 ]]，G_ijParameter value, f, of a keyword parameter expressed as a jth second candidate keyword within an ith paragraph block_iA parameter value of a parameter expressed as a target tag attribute of the ith segment drop;

The beneficial effects of the above technical scheme are: and further, the most important second candidate keywords for each segment of block can be determined by calculating the dependency of each segment block on the characteristic factor of each first target second candidate keyword corresponding to each segment block, so that the search range can be further reduced, and the efficiency of subsequently selecting entity keywords is improved.

The embodiment also discloses a system for extracting entity attribute information in a text, as shown in fig. 4, the system includes:

a word segmentation module 401, configured to obtain a target text and segment words of the target text;

a first determining module 402, configured to determine whether the segmented target text includes an entity keyword, and if so, determine a part of speech of a first target segment ahead of the entity keyword;

an adding module 403, configured to determine whether a part of speech of the first target word segmentation meets a preset entity attribute rule, and if yes, add the first target word segmentation to first target entity attribute information;

an obtaining module 404, configured to perform entity extraction again in the target text by using the first target entity attribute information as a parameter, and obtain an extraction result;

a second determining module 405, configured to determine second target entity attribute information of the target text according to the extraction result.

The working principle and the advantageous effects of the above technical solution have been explained in the method claims, and are not described herein again.

It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for extracting entity attribute information in a text is characterized by comprising the following steps:

acquiring a target text, and segmenting words of the target text;

2. The method for extracting entity attribute information from a text according to claim 1, wherein the obtaining a target text and performing word segmentation on the target text comprises:

checking the integrity of the target text;

3. The method of claim 2, wherein determining whether the segmented target text contains an entity keyword, and if so, determining a part of speech of a first target segment preceding the entity keyword comprises:

analyzing the text content of the target text;

4. The method for extracting entity attribute information in a text according to claim 1, wherein the step of determining whether the part of speech of the first target participle meets a preset entity attribute rule, and if so, the step of adding the first target participle to the first target entity attribute information comprises:

determining a target language class of the target text;

5. The method for extracting entity attribute information in a text according to claim 1, wherein the extracting entities in the target text again by using the first target entity attribute information as a parameter to obtain an extraction result comprises:

6. The method of claim 1, wherein the determining second target entity attribute information of the target text according to the extraction result comprises:

7. The method for extracting entity attribute information from text according to claim 6, wherein the method further comprises:

8. The method of claim 1, wherein the determining whether the segmented target text contains an entity keyword, and if so, determining a part of speech of a first target segment preceding the entity keyword comprises:

analyzing the segmented target text into a label tree structure;

counting the number of all target keywords to obtain M target keywords;

confirming the second target keyword as the first target word segmentation;

9. The method of claim 8, wherein the determining the target keyword related to the tag attribute in each paragraph block according to the tag attribute of each paragraph block comprises:

determining the number of second candidate keywords of each segment block;

acquiring a keyword parameter of each second candidate keyword;

wherein k is_ijExpressing the target association degree of the target label attribute of the jth second candidate keyword and the ith paragraph block in the ith paragraph block, expressing S as the parameter value of a gentle parameter, taking the value of 0.6, expressing Q as the number of divided paragraph blocks, wherein S is less than Q, and B is less than Q_iThe content complexity expressed as the ith segment block is [0.5,0.9 ]]，G_ijParameter value, f, of a keyword parameter expressed as a jth second candidate keyword within an ith paragraph block_iA parameter value of a parameter expressed as a target tag attribute of the ith segment drop;

10. A system for extracting entity attribute information from text, the system comprising: