CN109783651B

CN109783651B - Method and device for extracting entity related information, electronic equipment and storage medium

Info

Publication number: CN109783651B
Application number: CN201910087401.8A
Authority: CN
Inventors: 贺薇; 李双婕; 史亚冰; 梁海金; 张扬; 朱勇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2022-03-04
Anticipated expiration: 2039-01-29
Also published as: CN109783651A

Abstract

Embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a computer-readable storage medium for extracting entity-related information. In the method, a computing device obtains a plurality of candidate texts associated with a predetermined entity and a predetermined attribute. Further, the computing device determines at least one target text from the plurality of candidate texts based on semantics of an entity-attribute pair formed by the predetermined entity and the predetermined attribute. Further, the computing device determines an attribute value of a predetermined attribute of the predetermined entity based on the at least one target text. Embodiments of the present disclosure may improve timeliness and reduce labor costs when extracting entity-related information.

Description

Method and device for extracting entity related information, electronic equipment and storage medium

Technical Field

Embodiments of the present disclosure relate generally to the field of information processing technology, and more particularly, to a method, an apparatus, an electronic device, and a computer-readable storage medium for extracting entity-related information.

Background

Conventionally, there are two ways to extract entity-related information. One approach is a purely open abstraction, which consists primarily of open abstractions for free text and semi-structured web pages. That is, in free text and semi-structured web pages of the internet, related semantic relationships between entities and entities are openly mined, wherein the semi-structured web pages refer to web pages with certain structure, and the structural representation is based on hypertext markup language (HTML). For example, such triplets were directly dug out in the text "yaoming, born at 12.9.1980 in the xu-hui area of shanghai city" (yaoming, date of birth, born at 12.9.1980) and (yaoming, place of birth, xu-hui area of shanghai city). Another approach is structured extraction, which mainly refers to extracting entity-related information by manually configuring mapping relationships. That is, for a fixed website of a fixed vertical class, a plurality of mapping relation templates are manually configured for each website, for example, a regular template of a web page, an extensible markup language path (xPath), and the like are manually defined, so as to directionally extract data of a fixed structure in the web page.

However, these conventional schemes for extracting entity-related information have various problems and disadvantages, and the performance requirements for extracting entity-related information cannot be met in many occasions, thereby resulting in poor user experience in applications such as entity recommendation.

Disclosure of Invention

Embodiments of the present disclosure relate to a method, an apparatus, an electronic device, and a computer-readable storage medium for extracting entity-related information.

In a first aspect of the disclosure, a method of extracting entity-related information is provided. The method comprises the following steps: a plurality of candidate texts associated with a predetermined entity and a predetermined attribute are obtained. The method further comprises the following steps: at least one target text is determined from the plurality of candidate texts based on semantics of an entity-attribute pair formed by a predetermined entity and a predetermined attribute. The method further comprises the following steps: based on the at least one target text, an attribute value of a predetermined attribute of the predetermined entity is determined.

In a second aspect of the present disclosure, an apparatus for extracting entity-related information is provided. The device includes: a candidate text obtaining module configured to obtain a plurality of candidate texts associated with a predetermined entity and a predetermined attribute. The device also includes: a target text determination module configured to determine at least one target text from the plurality of candidate texts based on semantics of an entity attribute pair formed by a predetermined entity and a predetermined attribute. The apparatus further comprises: an attribute value determination module configured to determine an attribute value of a predetermined attribute of a predetermined entity based on the at least one target text.

In a third aspect of the disclosure, an electronic device is provided. The electronic device includes one or more processors; and a storage device for storing one or more programs. The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect.

In a fourth aspect of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when executed by a processor, implements the method of the first aspect.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other objects, features and advantages of the embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 illustrates a schematic diagram of an example environment in which some embodiments of the present disclosure can be implemented;

fig. 2 shows a schematic flow diagram of a method of extracting entity-related information according to an embodiment of the present disclosure;

fig. 3 shows a schematic block diagram of an apparatus for extracting entity-related information according to an embodiment of the present disclosure;

FIG. 4 shows a schematic block diagram of a general technical framework for extracting attribute values of entity attributes, according to an embodiment of the present disclosure; and

FIG. 5 shows a schematic block diagram of a device that may be used to implement embodiments of the present disclosure.

Throughout the drawings, the same or similar reference numerals are used to designate the same or similar components.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments shown in the drawings. It is understood that these specific embodiments are described merely to enable those skilled in the art to better understand and implement the present disclosure, and are not intended to limit the scope of the present disclosure in any way.

As mentioned above, the traditional entity relationship extraction methods mainly include a pure open extraction method and a structured extraction method. However, both of these conventional extraction methods have some problems and disadvantages. For example, a purely open extraction method is mainly used for batch extraction of knowledge, but the extraction delay of new entities and newly added knowledge is large, and the update time is long, so that the problem of time-efficient knowledge update cannot be solved. On the other hand, the main disadvantage of the structured extraction method is that the labor cost is high, the extraction template needs to be manually configured according to the webpage structure, and only a certain degree of directional extraction can be realized. By configuring the template of the target category, the orientation of the category granularity can be realized, but the orientation of the "entity + attribute" granularity cannot be realized yet.

In view of the above-mentioned problems and potentially other problems with conventional approaches, embodiments of the present disclosure propose a method, apparatus, electronic device, and computer-readable storage medium for extracting entity-related information to improve timeliness and reduce labor costs when extracting entity-related information. Specifically, the embodiment of the present disclosure provides a directed knowledge extraction technique, which is mainly used for extracting corresponding attribute values in a targeted manner under the condition of a given "entity-attribute" binary group. The proposed directional extraction technology aims to directionally extract high-confidence entity relation data from a text library (such as massive internet texts) through an information extraction technology.

From the perspective of knowledge graph construction, the provided directional extraction technology can extract the relation attribute value of entity missing, can be used for improving the connectivity of the knowledge graph, and can efficiently improve the knowledge richness and completeness of the knowledge graph. From the perspective of product application, the supplemented entity relationship data can directly meet the requirements of users on entity association, the efficiency of people in searching and browsing entities can be effectively improved, the user experience is improved, and typical applications can include entity question answering, entity recommendation and the like.

Compared with the traditional entity information extraction scheme, the embodiment of the disclosure solves the timeliness problem on one hand. If a new entity or a high-heat entity appears in a short time, due to the short updating time, the embodiments can quickly extract the missing attribute value of the new entity or the high-heat entity, supplement the entity attribute and improve the coverage of the knowledge graph on the timeliness entity-attribute value. On the other hand, embodiments of the present disclosure reduce labor costs by uniformly modeling all "entity-attribute value" relationships, for example, using deep learning models, thus eliminating the need for deep understanding of domain knowledge and the need to design complex advanced features for ease of maintenance and expansion. Several embodiments of the present disclosure are described below in conjunction with the following figures.

Fig. 1 illustrates a schematic diagram of an example environment (or system) 100 in which some embodiments of the present disclosure can be implemented. As shown in FIG. 1, in the example environment 100, the predetermined entity 105 and the predetermined attribute 110 may be input into the computing device 120 to obtain, by the computing device 120, an attribute value 160 of the predetermined attribute 110 of the predetermined entity 105, for example, from text in a text library (not shown). In some embodiments, the text library may include a collection of texts obtained from the internet. In other embodiments, the text corpus may include any suitable collection of text describing any attribute of any entity, including but not limited to collections of text for various purposes and sources.

In the context of the present disclosure, the term "entity" refers to something that is distinguishable and independent, such as a person, a city, a plant, a commodity, and so forth. Everything in the world is made up of specific things, all of which can be referred to as entities. For example, "china", "usa", "japan", etc. The term "attribute" refers to a property of an entity or a relationship between an entity and another entity. For example, an attribute may refer to a person's height, gender, place of birth, and so forth. Furthermore, an attribute may also refer to a relationship of an entity to another entity. Such as a husband, father, friend, etc. The term "attribute value" refers to the specific content of an entity attribute or another entity that has some relationship to the entity. For example, the attribute value of the attribute "gender" of a person may be "male". As another example, an attribute value that has a certain relationship attribute (e.g., wife) with a certain entity (e.g., Yaoming) may be another entity (e.g., Yely). It should be understood that the above definitions of various terms are merely exemplary to aid in understanding the present disclosure and are not intended to limit the scope of the present disclosure in any way. In other embodiments, various terms used herein will conform to technical meanings commonly understood by those skilled in the art.

With continued reference to FIG. 1, computing device 120 may obtain a plurality of candidate texts 140-1 through 140-N (which may be collectively referred to hereinafter as a plurality of candidate texts 140) associated with predetermined entity 105 and predetermined attribute 110 from a text repository based on the input predetermined entity 105 and predetermined attribute 110. Because the plurality of candidate texts 140 are related to the predetermined entity 105 and the predetermined attribute 110, the computing device 120 has the potential to extract the attribute value 160 from the plurality of candidate texts 140. Further, to improve the performance and robustness of the system 100, the computing device 120 may filter the plurality of candidate texts 140. To this end, computing device 120 may determine at least one target text 150-1 to 150-M (which may be collectively referred to hereinafter as a plurality of target texts 150) from a plurality of candidate texts 140 for extracting an attribute value 160 based on semantics of an entity-attribute pair consisting of predetermined entity 105 and predetermined attribute 110, where M and N are both positive integers and M may be less than or equal to N. Computing device 120 may then determine an attribute value 160 of predetermined attribute 110 of predetermined entity 105 based on the determined at least one target text 150.

It will be appreciated that the computing device 120 may be any type of mobile terminal, fixed terminal, or portable terminal including a mobile phone, station, unit, device, multimedia computer, multimedia tablet, internet node, communicator, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, Personal Communication System (PCS) device, personal navigation device, Personal Digital Assistant (PDA), audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, gaming device, or any combination thereof, including accessories and peripherals of these devices, or any combination thereof. It is also contemplated that computing device 120 can support any type of interface to the user (such as "wearable" circuitry, etc.). Example operations for extracting entity-related information in accordance with embodiments of the present disclosure are described below in conjunction with fig. 2.

Fig. 2 shows a schematic flow diagram of a method 200 of extracting entity-related information according to an embodiment of the present disclosure. In some embodiments, the method 200 may be implemented by the computing device 120 of fig. 1, for example, may be implemented by a processor or processing unit of the computing device 120. In other embodiments, all or part of the method 200 may also be implemented by a computing device separate from the computing device system 120, or may be implemented by other units in the example environment 100. For ease of discussion, the method 200 will be described in conjunction with FIG. 1.

At 210, computing device 120 obtains a plurality of candidate texts 140 associated with predetermined entities 105 and predetermined attributes 110. It should be appreciated that computing device 120 may obtain plurality of candidate texts 140 using any suitable approach, as long as plurality of candidate texts 140 are associated with predetermined entities 105 and predetermined attributes 110, embodiments of the present disclosure are not limited in this respect. For example, with respect to a particular attribute of a particular entity, there may already be a collection of text that introduces or accounts for that particular attribute of that particular entity. In this case, computing device 120 may obtain a plurality of candidate texts 140 by importing the set of texts.

More generally, in some embodiments, the computing device 120 may obtain the plurality of candidate texts 140 by searching in a text library. For example, computing device 120 may determine an entity term corresponding to predetermined entity 105 and an attribute term corresponding to predetermined attribute 110. Computing device 120 may then retrieve a plurality of candidate texts 140 from the text corpus using the determined entity terms and attribute terms. In this manner, the computing device 120 may find text in the text repository that is related to the predetermined entity 105 and the predetermined attribute 110. As noted above, the text corpus used for retrieval may include a collection of texts obtained from the internet. Additionally or alternatively, the corpus of text for retrieval may include any suitable collection of text describing any attribute of any entity, including but not limited to collections of text for various purposes and sources.

In some embodiments, the entity terms used by the computing device 120 may include the name of the predetermined entity 105, aliases, other keywords that may refer to the predetermined entity 105, and the like, as well as any combination thereof. Similarly, the attribute terms used by the computing device 120 may include the name, alias, lead, other keywords related to the predetermined attribute 110, and the like, as well as any combination thereof, of the predetermined attribute 110. As used herein, a lead of an attribute may be used to lead out a certain attribute of an entity. For example, the lead word "marriage" may be used to lead out the attribute "spouse" of the entity. In this manner, the computing device 120 can avoid missing text related to the predetermined entity 105 and the predetermined attribute 110 in the retrieval.

In some embodiments, to purposefully extract relevant information for trending entities or new entities, the computing device 120 may determine newly appearing entities or entities with search frequencies above a threshold as the predetermined entity 105. As an example of a popular entity, assume that there is currently a person with a higher social attention (such as a star) that has a higher search frequency on the search platform, which shows that the person is an entity with a high popularity in a short time. In this case, the computing device 120 may treat the persona as the predetermined entity 105. To this end, the computing device 120 may determine whether the entity has a higher search frequency by comparing the search frequency of the entity to a predetermined threshold. It will be appreciated that the threshold values herein may be reasonably selected depending on the particular system environment and design requirements. Additionally, as an example of a new entity, a newly built casino will be a newly emerging new entity if it is about to be opened to the public. In this case, computing device 120 may treat the casino as predetermined entity 105.

After determining the predetermined entity 105, the computing device 120 may determine the predetermined attribute 110 based on the predetermined entity 105. For example, where a star is determined as the predetermined entity 105, the computing device 120 may accordingly determine the predetermined attribute 110 as an attribute related to the star, such as height, weight, place of birth, graduation school, boyfriend, or the like. For another example, where a new casino is determined as the predetermined entity 105, the computing device 120 may accordingly determine the predetermined attribute 110 as an attribute related to the casino, such as a specific address, floor space, hours of business, attraction, and so forth.

At 220, computing device 120 determines at least one target text 150 from the plurality of candidate texts 140 based on semantics of an entity-attribute pair formed by predetermined entity 105 and predetermined attribute 110. It will be understood that although the plurality of candidate texts 140 are associated with the predetermined entity 105 and the predetermined attribute 110, this does not mean that the plurality of candidate texts 140 are necessarily semantically related to the semantics of the entity-attribute pair consisting of the predetermined entity 105 and the predetermined attribute 110. For example, a certain text might include the entity "Yaoming" and the attribute "height", but the semantics of the text do not necessarily relate to "Yaoming's height", which might just mention Yaoming and describe another person's height. Thus, by selecting at least one target text 150 based on the semantics of the entity-attribute pair of predetermined entity 105 and predetermined attribute 110, computing device 120 may filter all candidate texts 140 obtained, thereby reducing the amount of text used to extract attribute values 160, retaining only the semantic relevance of the entity-attribute pair formed by predetermined entity 105 and predetermined attribute 110, and being able to extract the text of attribute values 160, thereby improving the performance and robustness of system 100.

In some embodiments, for a given candidate text 140-1 of the plurality of candidate texts 140, the computing device 120 may process the candidate text 140-1 to determine the semantics of the candidate text 140-1. For example, computing device 120 may obtain the segmentation and part-of-speech recognition results of candidate text 140-1 via a part-of-speech recognition tool, obtain the dependency recognition results of sentences of candidate text 140-1 via a dependency analysis tool, and obtain the entity recognition and superordinate concept recognition results of candidate text 140-1 via a subgraph association tool. It should be appreciated that computing device 120 may also determine the semantics of candidate text 140-1 by any other semantic analysis method.

Next, computing device 120 may determine a similarity between the semantics of candidate text 140-1 and the semantics of the entity-attribute pair of predetermined entity 105 and predetermined attribute 110. For example, computing device 120 may invoke a semantically-related text validity classification model (or operator) to perform the semantic relevance calculation, and invoke a classification algorithm to determine whether the semantics of candidate text 140-1 are related to the semantics of the entity-attribute pair consisting of predetermined entity 105 and predetermined attribute 110, thereby filtering out text from candidate text 140 that is not related to the semantics. It should be appreciated that the computing device 120 may also determine the semantic relatedness by any other method of determining semantic similarity. Then, if the determined semantic similarity is above a threshold, computing device 120 may select candidate text 140-1 as one of the at least one target text 150. It will be appreciated that the threshold values herein may be reasonably selected depending on the particular system environment and design requirements.

Further, in some embodiments, prior to determining the semantics of the plurality of candidate texts 140, the computing device 120 may also perform a preliminary filtering of the plurality of candidate texts 140 to filter out candidate texts 140 that are not related to the semantics of the entity-attribute pair of the predetermined entity 105 and the predetermined attribute 110. For example, the computing device 120 may perform a preliminary filtering of the plurality of candidate texts 140 by determining whether the candidate texts 140 include characteristics such as a name of the predetermined entity 105 (including a name and an alias of the entity, etc.), a name of the predetermined attribute 110 (including a name, an alias, a lead word, etc.), whether a length of the text is within a predefined length interval, a Chinese character ratio of the text, etc., thereby excluding candidate texts 140 that are significantly unrelated to semantics of the entity-attribute pair of the predetermined entity 105 and the predetermined attribute 110.

At 230, the computing device 120 determines an attribute value 160 of the predetermined attribute 110 of the predetermined entity 105 based on the at least one target text 150. It should be understood that computing device 120 may extract attribute values 160 from at least one target text 150 using any existing extraction method or a future developed extraction method, embodiments of the present disclosure are not limited in this respect. For example, the computing device 120 may extract the attribute values 160 from the at least one target text 150 using a deep learning based extraction model. Additionally or alternatively, computing device 120 may also extract attribute values 160 from at least one target text 150 using other types of extraction models.

In some embodiments, to improve the accuracy of the extraction of attribute values 160, computing device 120 may extract a plurality of candidate attribute values from at least one target text 150 based on predetermined entities 105 and predetermined attributes 110 using a plurality of different extraction models having different model structures. It will be appreciated that the plurality of different extraction models may include any model capable of extracting attribute values from a given text according to predetermined entities and predetermined attributes, for example, a plurality of neural network-based extraction models having different neural network structures.

By way of example, the computing device 120 may use three different extraction models. The first extraction model may be a Slot Filling (Slot Filling) model, which is a deep learning model based on a deep learning computation framework (e.g., paddlepaddleplatform) and is directed to an attribute value extraction model performed by a Slot Filling task (known entities and attribute extraction attribute values). The other two extraction models can be reading understanding models with two different structures, and the two extraction models are attribute value extraction models based on the completion of the reading understanding task. Two reading understanding models can convert entities and attributes into queries, and take the queries and texts as model input, so as to mark the starting position and the ending position of attribute values in the texts. It will be understood that the specific models and number of models given herein are exemplary only and are not intended to limit the scope of the present disclosure in any way. In other embodiments, computing device 120 may use any number of any different models to extract attributes 160.

After extracting the plurality of candidate attribute values using the extracted ones having different model structures, the computing device 120 may determine respective confidences of the plurality of candidate attribute values. As an example, assuming that the predetermined entity 105 is "yaoming" and the predetermined attribute 110 is "place of birth," the plurality of candidate attribute values extracted from the at least one target text 150 by the plurality of different models may be china, the united states, beijing, shanghai. In this case, the computing device 120 may determine the confidence of each of the four candidate attribute values, i.e., the probability that they are the correct place of origin for the Yao.

It will be appreciated that the computing device 120 may determine the confidence of the candidate attribute value in any suitable manner, including but not limited to, by obtaining through an attribute value extraction model, by verifying through other repositories, by determining relevance to other attributes of the predetermined entity, and so forth. For example, in the example above regarding yaoming places of birth, the computing device 120 may determine that the confidence levels of china, the united states, beijing, shanghai, respectively, are 0.7, 0.3, 0.5, 0.8.

Computing device 120 may then select an attribute value from the plurality of candidate attribute values for which the confidence level is above the threshold. As an example, the threshold value here may be set to 0.7, so the computing device 120 may select "shanghai" as the attribute value of the predetermined attribute "place of birth" of the predetermined entity "yaoming". It should be understood that the specific values and location names given herein are exemplary only and are not intended to limit the scope of the present disclosure in any way. In addition, the threshold value may be chosen appropriately according to the specific system environment and design requirements. As an alternative to selecting an attribute value from a plurality of candidate attribute values, computing device 120 may also select the attribute value with the highest confidence from the plurality of candidate attribute values.

In some embodiments, at least one target text 150 may include a plurality of target texts 150-1 through 150-M. In this case, different extraction models may extract the same candidate attribute value from different target texts. Thus, to determine a respective confidence for each of the plurality of candidate attribute values, the computing device 120 may determine, for a given candidate attribute value of the plurality of candidate attribute values, a plurality of pairs of the extraction model from which the given candidate attribute value was extracted and the target text.

Continuing with the example used above, without loss of generality, assume that the candidate attribute value "Shanghai" is extracted from the first target text 150-1 by the first extraction model, from the second target text 150-2 by the second extraction model, from the fourth target text 150-4 by the second extraction model, and from the third target text 150-3 by the third extraction model. In this case, for the candidate attribute value "shanghai," the computing device 120 may determine a number of pairs for which the attribute value "shanghai" is extracted as follows: a first extraction model and a first target text 150-1, a first extraction model and a second target text 150-2, a second extraction model and a fourth target text 150-4, and a third extraction model and a third target text 150-3.

Next, the computing device 120 may obtain a plurality of confidence scores for the candidate attribute values, the plurality of confidence scores being respectively associated with a plurality of pairs. For example, continuing the example above, for the candidate attribute value "shanghai," the first extraction model gives a confidence score of 0.6 with respect to the first target text 150-1, the first extraction model gives a confidence score of 0.5 with respect to the second target text 150-2, the second extraction model gives a confidence score of 0.8 with respect to the second target text 150-2, the second extraction model gives a confidence score of 0.7 with respect to the fourth target text 150-4, and the third extraction model gives a confidence score of 0.6 with respect to the third target text 150-3. In this case, the computing device 120 may obtain a plurality of confidence scores for the candidate attribute value "shanghai" of 0.6, 0.5, 0.8, 0.7, 0.6.

The computing device 120 may then add the multiple confidence scores for the candidate attribute values to arrive at a confidence for the candidate attribute values. In the above example, the computing device 120 may add the multiple confidence scores of 0.6, 0.5, 0.8, 0.7, 0.6 for the attribute value "shanghai" to determine that the confidence of the candidate attribute value "shanghai" is 3.2. In this manner, the computing device 120 may synthetically evaluate the confidence level of a certain candidate attribute value in a quantitative manner. Similarly, the computing device 120 may calculate the confidence of other candidate attribute values (such as china, usa, and beijing), and finally select an attribute value with a confidence above a threshold.

Fig. 3 shows a schematic block diagram of an apparatus 300 for extracting entity-related information according to an embodiment of the present disclosure. In some embodiments, the apparatus 300 may be included in the computing device 120 of fig. 1 or implemented as the computing device 120.

As shown in fig. 3, the apparatus 300 includes a candidate text obtaining module 310, a target text determining module 320, and an attribute value determining module 330. The candidate text obtaining module 310 is configured to obtain a plurality of candidate texts associated with the predetermined entity and the predetermined attribute. The target text determination module 320 is configured to determine at least one target text from the plurality of candidate texts based on semantics of an entity-attribute pair formed by a predetermined entity and a predetermined attribute. The attribute value determination module 330 is configured to determine an attribute value of a predetermined attribute of the predetermined entity based on the at least one target text.

In some embodiments, the candidate text obtaining module 310 includes: a search term determination module configured to determine an entity search term corresponding to a predetermined entity and an attribute search term corresponding to a predetermined attribute; and a retrieval module configured to retrieve a plurality of candidate texts from the text library by using the entity retrieval word and the attribute retrieval word.

In some embodiments, the entity term includes at least one of a name and an alias of the predetermined entity, and the attribute term includes at least one of a name, an alias, and a lead term of the predetermined attribute, the lead term for leading out the predetermined attribute of the predetermined entity.

In some embodiments, the apparatus 300 further comprises: a predetermined entity determination module configured to determine a newly-appearing entity or an entity having a search frequency higher than a threshold as a predetermined entity; and a predetermined attribute determination module configured to determine a predetermined attribute based on the predetermined entity.

In some embodiments, for a given candidate text of the plurality of candidate texts, target text determination module 320 includes: a processing module configured to process a given candidate text to determine semantics of the given candidate text; a similarity determination module configured to determine a similarity between semantics of a given candidate text and semantics of an entity attribute pair; and a target text selection module configured to select the given candidate text as one of the at least one target text in response to the similarity being above a threshold.

In some embodiments, the attribute value determination module 330 includes: an attribute value extraction module configured to extract a plurality of candidate attribute values from at least one target text based on a predetermined entity and a predetermined attribute using a plurality of different extraction models having different model structures; a confidence determination module configured to determine confidence of the plurality of candidate attribute values; and an attribute value selection module configured to select an attribute value from the plurality of candidate attribute values for which the confidence is above a threshold.

In some embodiments, the at least one target text comprises a plurality of target texts, and for a given candidate attribute value of the plurality of candidate attribute values, the confidence determination module comprises: a pair determination module configured to determine a plurality of pairs of an extraction model from which a given candidate attribute value is extracted and a target text; a score obtaining module configured to obtain a plurality of confidence scores of the candidate attribute values respectively associated with the plurality of pairs; and a summing module configured to sum the plurality of confidence scores to obtain a confidence for the given candidate attribute value.

FIG. 4 shows a schematic block diagram of a general technical framework 400 for extracting attribute values of entity attributes in accordance with an embodiment of the present disclosure. As shown in fig. 4, the general technology framework 400 may include an attribute value extraction tool 401 and an external tool 403. In some embodiments, attribute value extraction tool 401 may utilize external tools 403 to implement embodiments of the present disclosure, such as method 200 described with respect to fig. 2. For example, the attribute value extraction tool 401 may directionally extract attribute value 407 information corresponding to a predetermined entity and a predetermined attribute from a text library after inputting the predetermined entity attribute pair 405.

The attribute value extraction tool 401 includes a text retrieval module 410, a text validity classification module 420, an attribute value extraction model 430, and a multi-source fusion module 440. The modules of the attribute value extraction tool 401 may utilize the retrieval interface 450 of the external tool 403, the library scanning tool 460, the dependency analysis and part-of-speech recognition module 470, the subgraph association module 480, and the deep learning framework 490 to implement the extraction of the attribute values 407, as described in detail below.

The main function of the text retrieval module 410 may include obtaining corpus text for attribute value extraction based on the input predetermined entity attribute pairs 405, for example, through a retrieval interface 450 and a sweep tool 460 (such as a seeksign sweep tool). Text retrieval module 410 supports obtaining textual information relating to predetermined entity attribute pairs from multiple sources of text retrieval models, and facilitates the addition of extensions to other models.

In addition, considering that entities often have the same name, the text retrieval module 410 may include two text acquisition manners combining entity granularity and text granularity, where the entity granularity refers to extracting only text information corresponding to an input entity without considering other entities with the same name, and the text granularity refers to considering all text information corresponding to all entities with the same name at the same time. In some embodiments, the text search model of the text search module 410 may include four categories of encyclopedia text, entity pages, question and answer text libraries, and search results to obtain relevant web page results, where the first two may be entity-granular and the second two may be text-granular.

The main functions of the text validity classification module 420 may include filtering and classifying all text obtained by the text retrieval module 410 to reduce the amount of text sent to subsequent modules, to retain only text related to predetermined entity attribute pairs, and to be able to extract attribute values, thereby improving the performance and robustness of the system. In some embodiments, the text validity classification module 420 may implement, for example, a semantic independent preliminary filtering function, a semantic information acquisition function, a semantic dependent classification function, and the like.

The semantic irrelevant initial filtering function can perform initial filtering by judging whether the text contains the name of an entity (including the name and the alias of the entity), the name of an attribute (including the name, the alias and a guide word of the attribute), the length of the text, the Chinese character proportion of the text and other characteristics. The semantic information obtaining function may obtain the segmentation and the part-of-speech recognition results through a part-of-speech recognition tool, obtain the dependency recognition results of the sentences through a dependency analysis tool, and obtain the entity recognition and the upper concept recognition results through a sub-graph association tool, for example. The semantic correlation classification function may, for example, invoke a semantic correlation text validity classification model to perform semantic correlation feature calculation, and invoke a classification algorithm to determine whether the text is semantically correlated with a predetermined entity and a predetermined attribute, thereby filtering out text irrelevant to semantics.

The main function of the attribute value extraction model 430 may include extracting attribute values corresponding to pairs of entity attributes in text given predetermined entities, predetermined attributes, and text for extracting attribute values. The attribute value extraction model 430 supports addition of a plurality of extraction models, i.e., results are obtained by the plurality of extraction models, respectively, and it is easy to expand the models.

The input of the multi-source fusion module 440 may be entity-attribute-text-attribute values, and the output may be entity-attribute values, and its main functions may include invoking a knowledge fusion model for each entity attribute pair to perform multi-source fusion and preference selection on attribute values generated by a plurality of attribute value extraction models from a plurality of target texts, and finally outputting the attribute values 407. In the multi-source fusion module 440, the extraction results of the plurality of extraction models in the attribute value extraction model 430 can be easily extended into candidate attribute values participating in the preference.

Fig. 5 schematically illustrates a block diagram of a device 500 that may be used to implement embodiments of the present disclosure. As shown in fig. 5, device 500 includes a Central Processing Unit (CPU)501 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a read-only memory device (ROM)502 or loaded from a storage unit 508 into a random access memory device (RAM) 503. In the RAM503, various programs and data required for the operation of the device 500 can also be stored. The CPU 501, ROM 502, and RAM503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The various processes and processes described above, such as method 200, may be performed by processing unit 501. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM503 and executed by CPU 501, one or more steps of method 200 described above may be performed.

As used herein, the terms "comprises," comprising, "and the like are to be construed as open-ended inclusions, i.e.," including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions may also be included herein.

As used herein, the term "determining" encompasses a wide variety of actions. For example, "determining" can include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Further, "determining" can include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory), and the like. Further, "determining" may include resolving, selecting, choosing, establishing, and the like.

It should be noted that the embodiments of the present disclosure can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, in programmable memory or on a data carrier such as an optical or electronic signal carrier.

Further, while the operations of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions. It should also be noted that the features and functions of two or more devices according to the present disclosure may be embodied in one device. Conversely, the features and functions of one apparatus described above may be further divided into embodiments by a plurality of apparatuses.

While the present disclosure has been described with reference to several particular embodiments, it is to be understood that the disclosure is not limited to the particular embodiments disclosed. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of extracting entity-related information, comprising:

obtaining a plurality of candidate texts associated with a predetermined entity and a predetermined attribute, the predetermined entity, the predetermined attribute and a corresponding attribute value to be determined constituting a group of a knowledge graph;

determining at least one target text from the plurality of candidate texts based on semantics of an entity attribute pair formed by the predetermined entity and the predetermined attribute;

determining the attribute value of the predetermined attribute of the predetermined entity based on the at least one target text; and

updating the set of knowledge-graphs based on the attribute values.

2. The method of claim 1, wherein obtaining the plurality of candidate texts comprises:

determining an entity search word corresponding to the predetermined entity and an attribute search word corresponding to the predetermined attribute; and

and utilizing the entity search word and the attribute search word to retrieve the candidate texts from a text library.

3. The method of claim 2, wherein the entity term comprises at least one of a name and an alias of the predetermined entity and the attribute term comprises at least one of a name, an alias and a lead term of the predetermined attribute, the lead term being used to lead the predetermined attribute of the predetermined entity.

4. The method of claim 1, further comprising:

determining a newly appeared entity or an entity with a search frequency higher than a threshold value as the predetermined entity; and

determining the predetermined attribute based on the predetermined entity.

5. The method of claim 1, wherein determining the at least one target text comprises: for a given candidate text of the plurality of candidate texts,

processing the given candidate text to determine semantics of the given candidate text;

determining a similarity between semantics of the given candidate text and semantics of the entity-attribute pair; and

in response to the similarity being above a threshold, selecting the given candidate text as one of the at least one target text.

6. The method of claim 1, wherein determining the attribute value comprises:

extracting a plurality of candidate attribute values from the at least one target text based on the predetermined entity and the predetermined attribute using a plurality of different extraction models having different model structures;

determining confidence levels for the plurality of candidate attribute values; and

an attribute value is selected from the plurality of candidate attribute values having a confidence level above a threshold.

7. The method of claim 6, wherein the at least one target text comprises a plurality of target texts, and wherein determining confidence levels for the plurality of candidate attribute values comprises: for a given candidate attribute value of the plurality of candidate attribute values,

determining a plurality of pairs of the extraction model extracting the given candidate attribute value and the target text;

obtaining a plurality of confidence scores for the candidate attribute values respectively associated with the plurality of pairs; and

adding the plurality of confidence scores to obtain a confidence for the given candidate attribute value.

8. An apparatus for extracting entity-related information, comprising:

a candidate text obtaining module configured to obtain a plurality of candidate texts associated with predetermined entities and predetermined attributes, the predetermined entities, the predetermined attributes and respective attribute values to be determined constituting a group of a knowledge graph;

a target text determination module configured to determine at least one target text from the plurality of candidate texts based on semantics of an entity attribute pair formed by the predetermined entity and the predetermined attribute; and

an attribute value determination module configured to determine attribute values of the predetermined attributes of the predetermined entities based on the at least one target text, the attribute values being used to update the set of knowledge-graphs.

9. The apparatus of claim 8, wherein the candidate text obtaining module comprises:

a search term determination module configured to determine an entity search term corresponding to the predetermined entity and an attribute search term corresponding to the predetermined attribute; and

a retrieval module configured to retrieve the plurality of candidate texts from a text library using the entity term and the attribute term.

10. The apparatus of claim 9, wherein the entity term comprises at least one of a name and an alias of the predetermined entity and the attribute term comprises at least one of a name, an alias and a lead of the predetermined attribute, the lead being used to lead out the predetermined attribute of the predetermined entity.

11. The apparatus of claim 8, further comprising:

a predetermined entity determination module configured to determine a newly-appearing entity or an entity having a search frequency higher than a threshold as the predetermined entity; and

a predetermined attribute determination module configured to determine the predetermined attribute based on the predetermined entity.

12. The apparatus of claim 8, wherein for a given candidate text of the plurality of candidate texts, the target text determination module comprises:

a processing module configured to process the given candidate text to determine semantics of the given candidate text;

a similarity determination module configured to determine a similarity between semantics of the given candidate text and semantics of the entity-attribute pair; and

a target text selection module configured to select the given candidate text as one of the at least one target text in response to the similarity being above a threshold.

13. The apparatus of claim 8, wherein the attribute value determination module comprises:

an attribute value extraction module configured to extract a plurality of candidate attribute values from the at least one target text based on the predetermined entity and the predetermined attribute using a plurality of different extraction models having different model structures;

a confidence determination module configured to determine confidence of the plurality of candidate attribute values; and

an attribute value selection module configured to select an attribute value from the plurality of candidate attribute values having a confidence level above a threshold.

14. The apparatus of claim 13, wherein the at least one target text comprises a plurality of target texts, and wherein for a given candidate attribute value of the plurality of candidate attribute values, the confidence determination module comprises:

a pair determination module configured to determine a plurality of pairs of an extraction model from which the given candidate attribute value is extracted and a target text;

a score obtaining module configured to obtain a plurality of confidence scores of the candidate attribute values respectively associated with the plurality of pairs; and

a summing module configured to sum the plurality of confidence scores to obtain a confidence for the given candidate attribute value.

15. An electronic device, comprising:

one or more processors; and

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1-7.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.