CN110134965B - Method, apparatus, device and computer readable storage medium for information processing - Google Patents

Method, apparatus, device and computer readable storage medium for information processing Download PDF

Info

Publication number
CN110134965B
CN110134965B CN201910426142.7A CN201910426142A CN110134965B CN 110134965 B CN110134965 B CN 110134965B CN 201910426142 A CN201910426142 A CN 201910426142A CN 110134965 B CN110134965 B CN 110134965B
Authority
CN
China
Prior art keywords
entity
representation
similarity
comparison
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910426142.7A
Other languages
Chinese (zh)
Other versions
CN110134965A (en
Inventor
方舟
冯知凡
张扬
陆超
朱勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910426142.7A priority Critical patent/CN110134965B/en
Publication of CN110134965A publication Critical patent/CN110134965A/en
Application granted granted Critical
Publication of CN110134965B publication Critical patent/CN110134965B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

According to example embodiments of the present disclosure, a method, apparatus, device, and computer-readable storage medium for information processing are provided. A method for information processing, comprising: acquiring characteristics of a first entity and characteristics of a second entity; generating a first entity representation based on the characteristics of the first entity; generating a second entity representation based on the characteristics of the second entity; determining feature similarities between features of the first entity and features of the second entity; and determining an entity similarity between the first entity and the second entity based on the first entity representation, the second entity representation, and the feature similarity. Therefore, the scheme can combine the similarity model of the descriptive information of the entity with the similarity of the characteristics of the entity to realize more accurate entity disambiguation.

Description

Method, apparatus, device and computer readable storage medium for information processing
Technical Field
Embodiments of the present disclosure relate generally to the field of information processing and, more particularly, relate to a method, apparatus, device, and computer-readable storage medium for information processing.
Background
With the rapid development of network technology, information is increasing, and the need to accurately obtain requested information is also increasing. However, the search results of users are often inaccurate due to ambiguity in natural language. Existing disambiguation schemes fail to meet the search needs of the user, so that the search experience of the user is reduced.
Disclosure of Invention
According to an example embodiment of the present disclosure, a scheme for information processing is provided.
In a first aspect of the present disclosure, there is provided a method for information processing, including: acquiring characteristics of a first entity and characteristics of a second entity; generating a first entity representation based on the characteristics of the first entity; generating a second entity representation based on the characteristics of the second entity; determining feature similarities between features of the first entity and features of the second entity; and determining an entity similarity between the first entity and the second entity based on the first entity representation, the second entity representation, and the feature similarity.
In a second aspect of the present disclosure, there is provided an information processing apparatus including: the feature acquisition module is configured to acquire features of the first entity and features of the second entity; a first entity representation generation module configured to generate a first entity representation based on a characteristic of the first entity; a second entity representation generation module configured to generate a second entity representation based on characteristics of the second entity; a feature similarity determination module configured to determine feature similarities between features of the first entity and features of the second entity; and an entity similarity determination module configured to determine an entity similarity between the first entity and the second entity based on the first entity representation, the second entity representation, and the feature similarity.
In a third aspect of the present disclosure, an electronic device is provided that includes one or more processors; and storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method according to the first aspect of the present disclosure.
In a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which when executed by a processor implements a method according to the first aspect of the present disclosure.
It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:
FIG. 1 illustrates a schematic diagram of one example environment in which embodiments of the present disclosure may be implemented;
FIG. 2 illustrates a flow chart for information processing according to some embodiments of the present disclosure;
FIG. 3 illustrates a schematic diagram of a similarity model, according to some embodiments of the present disclosure;
FIG. 4 illustrates a schematic block diagram of an apparatus for information processing according to some embodiments of the present disclosure; and
fig. 5 illustrates a block diagram of a computing device capable of implementing some embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.
The term "entity" may be an article, segment, or content, such as a semi-structured or unstructured article, segment, or content, etc., that is less informative on a network.
Conventionally, there are two solutions for disambiguating entities. The first approach is based on a supervised model approach. The scheme aims at disambiguation of structured entities with rich information. Because this approach relies on rich structured entity information, it is difficult to capture deep semantic information in entity information when processing semi-structured or unstructured entities with less entity information, and thus it is difficult to efficiently handle the problems associated with semi-structured or unstructured entities. Furthermore, this solution requires a large amount of annotation data and complex feature construction, resulting in high costs.
The second approach is a template and rule matching based approach. For example, the scheme may disambiguate an entity simply by matching certain characteristics of the entity (such as name or alias). Although this scheme has high accuracy, it cannot be flexibly applied to various types of entities. If a new type of entity appears, the scheme needs to regenerate a new template or rule matching method, resulting in higher migration cost and poorer scalability. For this purpose, a solution for entity disambiguation is proposed herein.
In general, according to embodiments of the present disclosure, an entity to be disambiguated (referred to herein as a "first entity") and an ambiguous entity that may be identical to the first entity (referred to herein as a "second entity") may be obtained. The first entity may be a semi-structured or unstructured entity that is less informative, such as an article on a network. Similarly, the second entity may also be a less informative semi-structured or unstructured entity, such as another article on a network.
A selection of features to compare with the first entity and the second entity may then be received and the features of the first entity and the second entity compared using a rule method (e.g., exact comparison, edit distance comparison, temporal comparison, text similarity comparison, co-occurrence comparison, numerical comparison, and type comparison).
In addition, a similarity model approach may also be used to determine the probability that the first entity and the second entity are the same entity. In the similarity model method, a depth model such as a twin neural network (siamese) model may be employed. In a conventional depth model, only textual descriptions or descriptive information of a first entity and a second entity are used to determine the probability that the first entity and the second entity are the same entity. However, unlike the conventional depth model, the present solution may determine the probability that the first entity and the second entity are the same entity by combining the similarity between the features of the first entity and the features of the second entity and the probability that the first entity exists ambiguous entity, in addition to using the description information of the first entity and the second entity.
In this way, the present solution may combine rule methods and similarity model methods to perform entity disambiguation to achieve more accurate entity disambiguation. Further, in the similarity model method, the scheme can combine the similarity model of the descriptive information of the entities and the similarity of the characteristics of the entities to more accurately determine the similarity between the entities.
Hereinafter, specific examples of the present scheme will be described in more detail with reference to fig. 1 to 5. FIG. 1 illustrates a schematic diagram of one example environment 100 in which embodiments of the present disclosure may be implemented. Storage system 100 includes computing device 110 and storage device 120. Computing device 110 may include, but is not limited to, any computing-capable device such as a cloud computing device, mainframe computer, server, personal computer, desktop computer, laptop computer, tablet computer, and personal digital assistant. Storage 120 may include any physical or virtual storage device with storage capabilities, such as databases, cloud storage, magnetic storage, optical storage, and the like.
The computing device 110 may obtain the first entity 130 from the storage device 120. As described above, the first entity 130 may be a semi-structured or unstructured entity that is less informative, such as an article on a network. The computing device 110 may pre-process the first entity 130 to represent the first entity 130 using several features (e.g., identification, type, descriptive information, key-value pairs, related entities, and multimedia content).
In addition, the computing device 110 may also obtain a second entity 140 from the storage device 120 that may be the same as the first entity 130. Similarly, the second entity 140 may also be a semi-structured or unstructured entity that is less informative, such as another article on a network. Thus, in some embodiments, the computing device 110 may also pre-process the second entity 140 to represent the second entity 140 using several features.
The computing device 110 may then receive a selection of features to compare with the first entity 130 and the second entity 140 and compare the features of the first entity 130 and the second entity 140 using a rule method (e.g., exact comparison, edit distance comparison, time comparison, text similarity comparison, co-occurrence comparison, numerical comparison, and type comparison).
In addition, the computing device 110 may also determine the probability that the first entity 130 and the second entity 140 are the same entity using a similarity model method. In the similarity model approach, a depth model, such as a twin neural network (siamese) model, may be employed to determine the similarity 150 between the first entity 130 and the second entity 140. The twin neural network model may include two mapping units (e.g., using a two-way long short term memory model (Bi-LSTM)), a fully connected unit (e.g., being a fully connected layer), and a classification unit (e.g., using a softmax model) with the same parameters.
In particular, the computing device 110 may generate a vector representing the first description information (referred to herein as a "first textual representation") and a vector representing the second description information (referred to herein as a "second textual representation") based on the description information of the first entity 130 (referred to herein as a "first description information") and the description information of the second entity 140 (referred to herein as a "second description information"). In some embodiments, the computing device 110 may apply the first descriptive information and the second descriptive information to the word vector unit to map the first descriptive information and the second descriptive information into a first textual representation and a second textual representation. The Word vector unit may use, for example, a BERT model, but is not limited thereto, and may also use, for example, a Word2Vec model, an ELMo model, or the like.
The computing device 110 may apply the generated first and second text representations to two mapping units (referred to herein as "first mapping unit" and "second mapping unit," respectively) having the same parameters to generate entity representations (referred to herein as "first entity representation" and "second entity representation," respectively) for the first and second entities 130, 140, such as low-dimensional hidden layer vectors of the first and second descriptive information. The mapping unit may use, for example, a Bi-LSTM model, but is not limited thereto, and the mapping unit may also use, for example, a CNN model, an RNN model, or the like. The computing device 110 may then apply the first entity representation and the second entity representation to the fully connected unit.
Further, the computing device 110 may also generate similarities between features of the first entity 130 and features of the second entity 140 (referred to herein as "feature similarities"). In some embodiments, feature similarity may be characterized by: co-occurrence of characteristics of the entities, similarity of description information of the entities, type consistency of types of the entities, similarity of multimedia contents of the entities, and the like.
In addition, the computing device 110 may also obtain a predetermined probability that the first entity 130 exists an ambiguous entity. The computing device 110 may then apply the feature similarity and the probability of existence of ambiguous entities equally to fully connected units.
Thus, the fully connected first entity representation, second entity representation, feature similarity, and probability of existence of ambiguous entities may be applied to the classification unit to determine the similarity 150 (referred to herein as "entity similarity") between the first entity 130 and the second entity 140. The classification unit may use, for example, a softmax model, but is not limited thereto, and the classification unit may also use, for example, a contrast Loss (contrast Loss) model, a cosine distance model, a euclidean distance model, and the like.
In this way, the present solution may combine rule methods and similarity model methods to achieve more accurate entity disambiguation. In a rule approach, the present approach may allow for receiving a selection of a feature to be compared of an entity such that a combination of the feature to be compared and the rule approach for comparison can be configured, thereby increasing the flexibility, adaptability and robustness of the present approach. Further, in the similarity model method, the similarity model of the description information of the entities and the similarity of the characteristics of the entities can be combined, so that the similarity between the entities can be determined more accurately.
Fig. 2 illustrates a flow chart 200 for information processing according to some embodiments of the present disclosure. For example, the method 200 may be performed at the computing device 110 shown in fig. 1 or other suitable device. Furthermore, method 200 may further include additional steps not shown and/or may omit steps shown, the scope of the present disclosure being not limited in this respect.
At 210, the computing device 110 obtains features of a first entity and features of a second entity. As described above, the first entity and the second entity may be semi-structured or unstructured entities that are less informative, such as an article on a network. In some embodiments, computing device 110 may extract features of the first entity from the first entity. Similarly, computing device 110 may extract features of the second entity from the second entity. For example, the computing device 110 may obtain the first entity and the second entity and pre-process the first entity and the second entity to represent the first entity and the second entity using several features. The feature may be, for example, an identification, a type, descriptive information, a key-value pair, a related entity, or multimedia content.
The identification may, for example, indicate a name or alias of the entity. The type may indicate a type of entity or a tag associated with the type. In some embodiments, if the acquired entity lacks a type, the computing device 110 will make a type prediction for the entity to determine the type of the entity.
The descriptive information may, for example, indicate a textual description for the entity. In some embodiments, if the retrieved entity lacks descriptive information, computing device 110 will generate the descriptive information using other characteristics of the entity. The key-value pair may be, for example, attribute information extracted from the description information, the attribute information being represented in the form of a key-value.
For example, a key value pair extracted from descriptive information "qili xiang" is an album released by a musician somehow around 2004 "may be expressed as" somehow around, a work, qili xiang ". In addition, the predetermined information such as numerals, places, persons, organizations, and the like in the description information may be differentially represented. For example, key value pairs extracted from the descriptive information "a football player 2018 meeting to ewing gas" may be expressed as "numbers: 2018, organization: ewing gauss.
The related entity may, for example, indicate other entities related to the entity, such as related to the entity and related to the entity of the related party. The multimedia content may for example comprise images and links of entities.
In some embodiments, the computing device 110 may perform entity disambiguation only on candidate entities associated with the first entity, but not on all entities, to reduce the computational complexity of entity disambiguation. For example, the computing device 110 may obtain an identification of the first entity based on the characteristics of the first entity. The computing device 110 may compare the identity of the first entity with the identity of the candidate entity. In the event that the identified similarity exceeds a predetermined threshold, the computing device 110 obtains features of the second entity for subsequent entity disambiguation.
The computing device 110 may then determine the entity similarity between the first entity and the second entity by a similarity model method. Hereinafter, the method 200 will be described in connection with the similarity model 300 shown in fig. 3. The similarity model 300 is used to determine the probability that the first entity and the second entity are the same entity. The similarity model may employ, for example, a twin neural network (siamese) model to determine similarity between the first entity and the second entity. The twin neural network model may include two mapping units 330 and 335 (e.g., bi-directional long-short term memory model (Bi-LSTM)), a fully connected unit 340 (e.g., fully connected layer), and a classification unit 350 (e.g., softmax model) having the same parameters.
At 220, the computing device 110 generates a first entity representation based on the characteristics of the first entity. In some embodiments, the computing device 110 may obtain first descriptive information for the first entity from the characteristics of the first entity. Computing device 110 may generate a first textual representation representing the first descriptive information based on the first descriptive information. The computing device 110 may then apply the first text representation to the first mapping unit 330 of the similarity model to map the first text representation to a first entity representation.
As an example of generating the first entity representation, the computing device 110 may apply the first descriptive information to the first word vector unit 320 to map the first descriptive information to a first text representation. The first Word vector unit 320 may use, for example, a BERT model, but is not limited thereto, and the first Word vector unit 320 may also use, for example, a Word2Vec model, an ELMo model, or the like. The computing device 110 may then apply the generated first text representation to the first mapping unit 330 to generate a first entity representation. The first mapping unit 330 may use, for example, a Bi-LSTM model, but is not limited thereto, and the first mapping unit 330 may also use, for example, a CNN model, an RNN model, or the like. The first entity representation may be a low-dimensional hidden layer vector such as the first description information.
At 230, the computing device 110 generates a second entity representation based on the characteristics of the second entity. In some embodiments, the computing device 110 may obtain second descriptive information for the second entity from the features of the second entity. Computing device 110 may generate a second textual representation representing the second descriptive information based on the second descriptive information. The computing device 110 may then apply the second textual representation to a second mapping unit 335 of the similarity model to map the second textual representation to a second entity representation. The first mapping unit 330 and the second mapping unit 335 have the same parameters.
Similar to generating the first entity representation, the computing device 110 may, for example, apply the second descriptive information to the second word vector unit 325 to map the second descriptive information to a second text representation when generating the second entity representation. The computing device 110 may then apply the generated second text representation to the second mapping unit 335 to generate a second entity representation. The second mapping unit 335 has the same parameters as the first mapping unit 330, and the same model, such as Bi-LSTM model, CNN model, RNN model, etc., may be used. The second entity representation may be a low-dimensional hidden layer vector such as second description information.
At 240, the computing device 110 determines feature similarities between features of the first entity and features of the second entity. In some embodiments, the computing device 110 may apply the features of the first entity and the features of the second entity to the comparison unit 360 to determine feature similarity. Feature similarity may be determined in a number of ways as follows. In some embodiments, the computing device 110 may determine co-occurrence of a feature of the first entity and a feature of the second entity. For example, co-occurrence of related entities of the first entity and related entities of the second entity. In other embodiments, the computing device 110 may determine a similarity of the descriptive information of the first entity and the descriptive information of the second entity, such as a Probabilistic Latent Semantic Analysis (PLSA) similarity of the descriptive information, a similarity of substrings of the descriptive information, and the like. In still other embodiments, the computing device 110 may determine a type consistency of the type of the first entity and the type of the second entity, e.g., whether the types are the same, whether the types belong to a context, whether the type predictions are consistent, etc. Additionally or alternatively, the computing device 110 may determine a similarity of the multimedia content of the first entity and the multimedia content of the second entity, e.g., a similarity of pictures of the entities, a similarity of links referenced by the entities, etc.
At 250, the computing device 110 may determine an entity similarity between the first entity and the second entity based on the first entity representation, the second entity representation, and the feature similarity. In some embodiments, the computing device 110 may apply the first entity representation, the second entity representation, and the feature similarity to the fully-connected unit 340 of the similarity model to generate a fully-connected representation, e.g., a fully-connected vector. Further, in some embodiments, the computing device 110 may also obtain a probability of whether the first entity is an ambiguous entity. In this case, the computing device 110 may also apply the obtained probabilities and the first entity representation, the second entity representation, and the feature similarity to the fully connected unit 340 of the similarity model to generate a fully connected representation.
The computing device 110 may then apply the fully connected representation to the classification unit 350 to generate a probability that the first entity and the second entity are the same entity as the entity similarity. As described above, the classification unit 350 may be, for example, a softmax model, but not limited thereto, and the classification unit 350 may also be, for example, a contrast Loss (contrast Loss) model, a cosine distance model, a euclidean distance model, or the like. In the event that the entity similarity exceeds a predetermined threshold, the computing device 110 may determine that the first entity and the second entity are similar entities
The similarity model method is described above. In some embodiments, the similarity model method may be performed in conjunction with a rule method. For example, in a rule method, the computing device 110 may receive a selection of a feature to compare with the first entity and the second entity. The computing device 110 may determine whether the first entity and the second entity are different entities by comparing features of the selected first entity with features of the selected second entity.
Such comparisons include exact comparisons, edit distance comparisons, time comparisons, text similarity comparisons, co-occurrence comparisons, numerical comparisons, or type comparisons. The exact comparison may, for example, exactly match two strings to determine if they are identical. The edit distance comparison may return, for example, a minimum edit (Levinstein) distance of two strings, which may be a continuous value between 0-1. The time comparison may, for example, customize a threshold value, if the absolute value of the difference between the two values is less than the threshold value, then it is determined that they are the same. The text similarity comparison may for example calculate PLSA similarity of two values. The co-occurrence comparison may, for example, determine whether one string appears in another string. The digital comparison may, for example, compare two floating point numbers and customize a threshold value, if the absolute value of the difference between the two values is less than the threshold value, then it is determined that they are the same. The type comparison may, for example, compare the similarity between the types of the two entities.
In some embodiments, where the first entity and the second entity are determined to be different entities using the rule method described above, the computing device 110 may perform a similarity model method to determine entity similarity. Alternatively, where the first entity and the second entity are the same entity, the computing device 110 may also perform a similarity model method to determine entity similarity. Further, although the similarity model method is described herein as being performed after the rule method, the order in which the similarity model method and the rule method are performed is not limited, e.g., the similarity model method may be performed before the rule method, or in parallel.
In this way, the present solution may combine rule methods and similarity model methods to perform entity disambiguation to achieve more accurate entity disambiguation. Further, in the similarity model method, the scheme can combine the depth model of the descriptive information of the entities and the similarity of the characteristics of the entities to more accurately determine the similarity between the entities.
Fig. 4 shows a schematic block diagram of an apparatus 400 for information processing according to an embodiment of the disclosure. As shown in fig. 4, the apparatus 400 includes: a feature acquisition module 410 configured to acquire features of a first entity and features of a second entity; a first entity representation generation module 420 configured to generate a first entity representation based on the characteristics of the first entity; a second entity representation generation module 430 configured to generate a second entity representation based on the characteristics of the second entity; a feature similarity determination module 440 configured to determine feature similarities between features of the first entity and features of the second entity; and an entity similarity determination module 450 configured to determine an entity similarity between the first entity and the second entity based on the first entity representation, the second entity representation, and the feature similarity.
In some embodiments, the feature acquisition module 410 includes: a first entity feature acquisition module configured to extract features of the first entity from the first entity, the features including at least one of: identification, type, descriptive information, key value pairs, related entities, and multimedia content.
In some embodiments, the feature acquisition module 410 includes: the first entity mark acquisition module is configured to acquire the mark of a first entity; and a second entity characteristic acquisition module configured to acquire a characteristic of the second entity in response to an identity similarity between the identity of the first entity and the identity of the candidate entity exceeding a predetermined threshold.
In some embodiments, the first entity representation generation module 420 includes: a first descriptive information acquisition module configured to acquire first descriptive information for a first entity from a feature of the first entity; a first text representation generation module configured to generate a first text representation representing the first descriptive information based on the first descriptive information; and a first entity representation mapping module configured to apply the first text representation to a first mapping unit of the similarity model to map the first text representation to the first entity representation.
In some embodiments, the second entity representation generation module 430 includes: a second descriptive information acquisition module configured to acquire second descriptive information for a second entity from a feature of the second entity; a second text representation generation module configured to generate a second text representation representing the second descriptive information based on the second descriptive information; and a second entity representation mapping module configured to apply the second text representation to a second mapping unit of the similarity model to map the second text representation to the second entity representation, wherein the first mapping unit and the second mapping unit have the same parameters.
In some embodiments, the entity similarity determination module 450 includes: a fully connected representation generation module configured to apply the first entity representation, the second entity representation, and the feature similarity to fully connected units of the similarity model to generate a fully connected representation; and a first entity similarity determination module configured to apply the fully connected representation to the classification unit to generate a probability that the first entity and the second entity are the same entity as the entity similarity.
In some embodiments, the entity similarity determination module 450 includes: the probability acquisition module is configured to acquire the probability of whether the first entity has an ambiguous entity; and a second entity similarity determination module configured to determine entity similarity between the first entity and the second entity based on the first entity representation, the second entity representation, the feature similarity, and the probability
In some embodiments, the entity similarity determination module 450 includes: a selection receiving module configured to receive a selection of a feature to be compared with the first entity and the second entity; a comparison module configured to determine whether the first entity and the second entity are different entities by comparing the characteristics of the selected first entity with the characteristics of the selected second entity; and a third entity similarity determination module configured to determine entity similarity in response to the first entity and the second entity being different entities.
Fig. 5 shows a schematic block diagram of an example device 500 that may be used to implement embodiments of the present disclosure. Device 500 may be used to implement computing device 110 of fig. 1. As shown, the device 500 includes a Central Processing Unit (CPU) 501 that may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 502 or loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processing unit 501 performs the various methods and processes described above, such as process 200. For example, in some embodiments, the process 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by CPU 501, one or more steps of process 200 described above may be performed. Alternatively, in other embodiments, CPU 501 may be configured to perform process 200 by any other suitable means (e.g., by means of firmware).
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), etc.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims (14)

1. An information processing method, comprising:
acquiring characteristics of a first entity and characteristics of a second entity;
generating a first entity representation based on the characteristics of the first entity, wherein generating the first entity representation comprises: acquiring first description information aiming at the first entity from the characteristics of the first entity; generating a first text representation representing the first descriptive information based on the first descriptive information; and a first mapping unit that applies the first text representation to a similarity model to map the first text representation to the first entity representation;
generating a second entity representation based on the characteristics of the second entity, wherein generating the second entity representation comprises: acquiring second description information aiming at the second entity from the characteristics of the second entity; generating a second text representation representing the second descriptive information based on the second descriptive information; and a second mapping unit that applies the second text representation to the similarity model to map the second text representation to the second entity representation, wherein the first mapping unit and the second mapping unit have the same parameters;
determining feature similarities between features of the first entity and features of the second entity; and
determining an entity similarity between the first entity and the second entity based on the first entity representation, the second entity representation, and the feature similarity, wherein determining the entity similarity comprises: applying the first entity representation, the second entity representation, and the feature similarity to a fully connected unit of the similarity model to generate a fully connected representation; and applying the fully connected representation to a classification unit to generate a probability that the first entity and the second entity are the same entity as the entity similarity.
2. The method of claim 1, wherein obtaining the characteristics of the first entity comprises:
extracting features of the first entity from the first entity, the features including at least one of: identification, type, descriptive information, key value pairs, related entities, and multimedia content.
3. The method of claim 1, wherein obtaining the characteristics of the second entity comprises:
acquiring the identification of the first entity; and
and acquiring the characteristics of the second entity in response to the identity similarity between the identity of the first entity and the identity of the candidate entity exceeding a predetermined threshold.
4. The method of claim 1, wherein determining the entity similarity comprises:
acquiring the probability of whether the first entity has an ambiguous entity or not; and
an entity similarity between the first entity and the second entity is determined based on the first entity representation, the second entity representation, the feature similarity, and the probability.
5. The method of claim 1, wherein determining the entity similarity comprises:
receiving a selection of a feature to compare with the first entity and the second entity;
determining whether the first entity and the second entity are different entities by comparing the selected characteristics of the first entity with the selected characteristics of the second entity; and
and determining the entity similarity in response to the first entity and the second entity being different entities.
6. The method of claim 5, wherein the comparing comprises at least one of: accurate comparison, edit distance comparison, time comparison, text similarity comparison, co-occurrence comparison, numerical comparison, and genre comparison.
7. An information processing apparatus comprising:
the feature acquisition module is configured to acquire features of the first entity and features of the second entity;
a first entity representation generation module configured to generate a first entity representation based on a characteristic of the first entity, wherein generating the first entity representation comprises: acquiring first description information aiming at the first entity from the characteristics of the first entity; generating a first text representation representing the first descriptive information based on the first descriptive information; and a first mapping unit that applies the first text representation to a similarity model to map the first text representation to the first entity representation;
a second entity representation generation module configured to generate a second entity representation based on characteristics of the second entity, wherein generating the second entity representation comprises: acquiring second description information aiming at the second entity from the characteristics of the second entity; generating a second text representation representing the second descriptive information based on the second descriptive information; and a second mapping unit that applies the second text representation to the similarity model to map the second text representation to the second entity representation, wherein the first mapping unit and the second mapping unit have the same parameters;
a feature similarity determination module configured to determine feature similarities between features of the first entity and features of the second entity; and
an entity similarity determination module configured to determine an entity similarity between the first entity and the second entity based on the first entity representation, the second entity representation, and the feature similarity, wherein determining the entity similarity comprises: applying the first entity representation, the second entity representation, and the feature similarity to a fully connected unit of the similarity model to generate a fully connected representation; and applying the fully connected representation to a classification unit to generate a probability that the first entity and the second entity are the same entity as the entity similarity.
8. The apparatus of claim 7, wherein obtaining characteristics of the first entity comprises:
extracting features of the first entity from the first entity, the features including at least one of: identification, type, descriptive information, key value pairs, related entities, and multimedia content.
9. The apparatus of claim 7, wherein obtaining characteristics of the second entity comprises:
acquiring the identification of the first entity; and
and acquiring the characteristics of the second entity in response to the identity similarity between the identity of the first entity and the identity of the candidate entity exceeding a predetermined threshold.
10. The apparatus of claim 7, wherein determining the entity similarity comprises:
acquiring the probability of whether the first entity has an ambiguous entity or not; and
an entity similarity between the first entity and the second entity is determined based on the first entity representation, the second entity representation, the feature similarity, and the probability.
11. The apparatus of claim 7, wherein determining the entity similarity comprises:
receiving a selection of a feature to compare with the first entity and the second entity;
determining whether the first entity and the second entity are different entities by comparing the selected characteristics of the first entity with the selected characteristics of the second entity; and
and determining the entity similarity in response to the first entity and the second entity being different entities.
12. The apparatus of claim 11, wherein the comparison comprises at least one of: accurate comparison, edit distance comparison, time comparison, text similarity comparison, co-occurrence comparison, numerical comparison, and genre comparison.
13. An electronic device, the device comprising:
one or more processors; and
storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1-6.
14. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of any of claims 1-6.
CN201910426142.7A 2019-05-21 2019-05-21 Method, apparatus, device and computer readable storage medium for information processing Active CN110134965B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910426142.7A CN110134965B (en) 2019-05-21 2019-05-21 Method, apparatus, device and computer readable storage medium for information processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910426142.7A CN110134965B (en) 2019-05-21 2019-05-21 Method, apparatus, device and computer readable storage medium for information processing

Publications (2)

Publication Number Publication Date
CN110134965A CN110134965A (en) 2019-08-16
CN110134965B true CN110134965B (en) 2023-08-18

Family

ID=67572367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910426142.7A Active CN110134965B (en) 2019-05-21 2019-05-21 Method, apparatus, device and computer readable storage medium for information processing

Country Status (1)

Country Link
CN (1) CN110134965B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11556845B2 (en) * 2019-08-29 2023-01-17 International Business Machines Corporation System for identifying duplicate parties using entity resolution
US11544477B2 (en) 2019-08-29 2023-01-03 International Business Machines Corporation System for identifying duplicate parties using entity resolution
CN110674304A (en) * 2019-10-09 2020-01-10 北京明略软件系统有限公司 Entity disambiguation method and device, readable storage medium and electronic equipment
CN112163109A (en) * 2020-09-24 2021-01-01 中国科学院计算机网络信息中心 Entity disambiguation method and system based on picture
CN112257446A (en) * 2020-10-20 2021-01-22 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and readable storage medium

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2664997A2 (en) * 2012-05-18 2013-11-20 Xerox Corporation System and method for resolving named entity coreference
CN105808689A (en) * 2016-03-03 2016-07-27 中国地质大学(武汉) Drainage system entity semantic similarity measurement method based on artificial neural network
US9710544B1 (en) * 2016-05-19 2017-07-18 Quid, Inc. Pivoting from a graph of semantic similarity of documents to a derivative graph of relationships between entities mentioned in the documents
CN107506486A (en) * 2017-09-21 2017-12-22 北京航空航天大学 A kind of relation extending method based on entity link
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN107943860A (en) * 2017-11-08 2018-04-20 北京奇艺世纪科技有限公司 The recognition methods and device that the training method of model, text are intended to
CN107992480A (en) * 2017-12-25 2018-05-04 东软集团股份有限公司 A kind of method, apparatus for realizing entity disambiguation and storage medium, program product
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN108288067A (en) * 2017-09-12 2018-07-17 腾讯科技(深圳)有限公司 Training method, bidirectional research method and the relevant apparatus of image text Matching Model
CN108376160A (en) * 2018-02-12 2018-08-07 北京大学 A kind of Chinese knowledge mapping construction method and system
CN108389614A (en) * 2018-03-02 2018-08-10 西安交通大学 The method for building medical image collection of illustrative plates based on image segmentation and convolutional neural networks
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature
AU2018201708A1 (en) * 2017-03-09 2018-09-27 Tata Consultancy Services Limited Method and system for mapping attributes of entities
CN108681537A (en) * 2018-05-08 2018-10-19 中国人民解放军国防科技大学 Chinese entity linking method based on neural network and word vector
JP6462970B1 (en) * 2018-05-21 2019-01-30 楽天株式会社 Classification device, classification method, generation method, classification program, and generation program
CN109299462A (en) * 2018-09-20 2019-02-01 武汉理工大学 Short text similarity calculating method based on multidimensional convolution feature
JPWO2018042665A1 (en) * 2016-09-05 2019-04-11 富士通株式会社 Information presentation method, apparatus, and program
CN109635297A (en) * 2018-12-11 2019-04-16 湖南星汉数智科技有限公司 A kind of entity disambiguation method, device, computer installation and computer storage medium
CN109710760A (en) * 2018-12-20 2019-05-03 泰康保险集团股份有限公司 Clustering method, device, medium and the electronic equipment of short text

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437868B2 (en) * 2016-03-04 2019-10-08 Microsoft Technology Licensing, Llc Providing images for search queries
CN107133202A (en) * 2017-06-01 2017-09-05 北京百度网讯科技有限公司 Text method of calibration and device based on artificial intelligence

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2664997A2 (en) * 2012-05-18 2013-11-20 Xerox Corporation System and method for resolving named entity coreference
CN105808689A (en) * 2016-03-03 2016-07-27 中国地质大学(武汉) Drainage system entity semantic similarity measurement method based on artificial neural network
US9710544B1 (en) * 2016-05-19 2017-07-18 Quid, Inc. Pivoting from a graph of semantic similarity of documents to a derivative graph of relationships between entities mentioned in the documents
JPWO2018042665A1 (en) * 2016-09-05 2019-04-11 富士通株式会社 Information presentation method, apparatus, and program
AU2018201708A1 (en) * 2017-03-09 2018-09-27 Tata Consultancy Services Limited Method and system for mapping attributes of entities
CN108288067A (en) * 2017-09-12 2018-07-17 腾讯科技(深圳)有限公司 Training method, bidirectional research method and the relevant apparatus of image text Matching Model
CN107506486A (en) * 2017-09-21 2017-12-22 北京航空航天大学 A kind of relation extending method based on entity link
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN107943860A (en) * 2017-11-08 2018-04-20 北京奇艺世纪科技有限公司 The recognition methods and device that the training method of model, text are intended to
CN107992480A (en) * 2017-12-25 2018-05-04 东软集团股份有限公司 A kind of method, apparatus for realizing entity disambiguation and storage medium, program product
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN108376160A (en) * 2018-02-12 2018-08-07 北京大学 A kind of Chinese knowledge mapping construction method and system
CN108389614A (en) * 2018-03-02 2018-08-10 西安交通大学 The method for building medical image collection of illustrative plates based on image segmentation and convolutional neural networks
CN108399163A (en) * 2018-03-21 2018-08-14 北京理工大学 Bluebeard compound polymerize the text similarity measure with word combination semantic feature
CN108681537A (en) * 2018-05-08 2018-10-19 中国人民解放军国防科技大学 Chinese entity linking method based on neural network and word vector
JP6462970B1 (en) * 2018-05-21 2019-01-30 楽天株式会社 Classification device, classification method, generation method, classification program, and generation program
CN109299462A (en) * 2018-09-20 2019-02-01 武汉理工大学 Short text similarity calculating method based on multidimensional convolution feature
CN109635297A (en) * 2018-12-11 2019-04-16 湖南星汉数智科技有限公司 A kind of entity disambiguation method, device, computer installation and computer storage medium
CN109710760A (en) * 2018-12-20 2019-05-03 泰康保险集团股份有限公司 Clustering method, device, medium and the electronic equipment of short text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于神经网络的异构网络向量化表示方法;吴卫祖;刘利群;谢冬青;;计算机科学(第05期);第272-275页 *

Also Published As

Publication number Publication date
CN110134965A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
CN110134965B (en) Method, apparatus, device and computer readable storage medium for information processing
KR102564144B1 (en) Method, apparatus, device and medium for determining text relevance
CN109840321B (en) Text recommendation method and device and electronic equipment
WO2019136993A1 (en) Text similarity calculation method and device, computer apparatus, and storage medium
CN109933785B (en) Method, apparatus, device and medium for entity association
JP5424001B2 (en) LEARNING DATA GENERATION DEVICE, REQUESTED EXTRACTION EXTRACTION SYSTEM, LEARNING DATA GENERATION METHOD, AND PROGRAM
JP7301922B2 (en) Semantic retrieval method, device, electronic device, storage medium and computer program
US20220318275A1 (en) Search method, electronic device and storage medium
CN114612759B (en) Video processing method, video query method, model training method and model training device
US11436282B2 (en) Methods, devices and media for providing search suggestions
Suo et al. A simple and robust correlation filtering method for text-based person search
CN111078842A (en) Method, device, server and storage medium for determining query result
Li et al. Social context-aware person search in videos via multi-modal cues
CN108628911B (en) Expression prediction for user input
WO2023137903A1 (en) Reply statement determination method and apparatus based on rough semantics, and electronic device
WO2022198747A1 (en) Triplet information extraction method and apparatus, electronic device and storage medium
CN112966513B (en) Method and apparatus for entity linking
CN113392630A (en) Semantic analysis-based Chinese sentence similarity calculation method and system
CN114329016A (en) Picture label generation method and character matching method
CN113705692A (en) Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN108009233B (en) Image restoration method and device, computer equipment and storage medium
WO2013150633A1 (en) Document processing system and document processing method
CN115828915B (en) Entity disambiguation method, device, electronic equipment and storage medium
CN114723073B (en) Language model pre-training method, product searching method, device and computer equipment
CN116205236B (en) Data rapid desensitization system and method based on entity naming identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant