CN114860954A - Entity linking method, device, equipment and medium - Google Patents

Entity linking method, device, equipment and medium Download PDF

Info

Publication number
CN114860954A
CN114860954A CN202210470501.0A CN202210470501A CN114860954A CN 114860954 A CN114860954 A CN 114860954A CN 202210470501 A CN202210470501 A CN 202210470501A CN 114860954 A CN114860954 A CN 114860954A
Authority
CN
China
Prior art keywords
entity
data
target
elements
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210470501.0A
Other languages
Chinese (zh)
Inventor
王圣
高雅
卫海天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Minglue Zhaohui Technology Co Ltd
Original Assignee
Beijing Minglue Zhaohui Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Minglue Zhaohui Technology Co Ltd filed Critical Beijing Minglue Zhaohui Technology Co Ltd
Priority to CN202210470501.0A priority Critical patent/CN114860954A/en
Publication of CN114860954A publication Critical patent/CN114860954A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to an entity linking method, an entity linking device, equipment and a medium, wherein the method comprises the following steps: extracting element data from target data, wherein the target data comprises a field to be linked to corresponding entity data; matching the element data with entity elements in a preset element library to obtain a matching result, wherein the preset element library is a database of entity elements which are arranged in advance and comprise a plurality of entity data; and determining a target entity to be linked of the target data according to the matching result. According to the method, the link entity corresponding to the target data is determined by the method of element matching between the target data and the entity of the preset element library, so that the problems that manual matching efficiency is low and a large amount of labeled data is consumed in model matching are solved.

Description

Entity linking method, device, equipment and medium
Technical Field
The present application relates to the field of deep learning technologies, and in particular, to a method, an apparatus, a device, and a medium for entity linking.
Background
As network data grows exponentially, networks have become one of the largest data repositories, and large amounts of data are presented in natural language on networks, but natural language itself is highly ambiguous, especially for some more frequent entities, which may correspond to multiple names. Therefore, if the mentioned words in the network data can be connected with the entities in the knowledge base, we can label the natural language on the network, which provides great convenience for us to understand the semantic information of the network data. However, entities corresponding to the mentioning words may be various, the corresponding relationship of manual arrangement is limited, and a large amount of labeled data is consumed for training only depending on model matching, so that a good effect is difficult to achieve.
Aiming at the problem that the entities corresponding to the mentioning words are possibly various and have limited corresponding relations of manual arrangement, and a large amount of labeled data is consumed for training only depending on model matching, so that a good effect is difficult to achieve, an effective solution is not provided at present.
Disclosure of Invention
The application provides an entity linking method, device, equipment and medium, which are used for solving or at least partially solving the technical problems that entities corresponding to the mentioned words are possibly various, the corresponding relation of manual arrangement is limited, and a large amount of labeled data is consumed for training only depending on model matching, so that a better effect is difficult to achieve.
According to an aspect of an embodiment of the present application, there is provided an entity linking method, including: extracting element data from target data, wherein the target data comprises a field to be linked to corresponding entity data; matching the element data with entity elements in a preset element library to obtain a matching result, wherein the preset element library is a database of entity elements which are arranged in advance and comprise a plurality of entity data; and determining a target entity to be linked of the target data according to the matching result.
Optionally, matching the element data with the entity elements in the preset element library includes: matching the first main element in the element data with the second main element of each entity element respectively; in the case that all second main elements of the plurality of entity elements are completely matched with the first main elements, determining entities corresponding to the plurality of entity elements as first candidate entities, and respectively matching the first secondary elements with second secondary elements of the plurality of first candidate entities, wherein the element data comprises the first main elements and the first secondary elements, and the entity elements comprise the second main elements and the second secondary elements; in the case where the number of first candidate entities for which the number of second minor elements matching the first minor elements is greater than or equal to the first threshold is greater than or equal to zero, the first candidate entities for which the number of second minor elements matching the first minor elements is greater than or equal to the first threshold are determined as the second candidate entities.
Optionally, after the first main elements in the element data are respectively matched with the second main elements of the entity elements, the method further includes: when a second main element of only one entity element is completely matched with the first main element, determining entity data corresponding to the entity element as a second candidate entity; when a second principal component of the non-existent entity components is completely matched with the first principal component, it is determined that a second candidate entity is non-existent.
Optionally, determining the target entity to be linked to the target data according to the matching result includes: when the second candidate entity does not exist, determining that the target data does not have a target entity to be linked; when a second candidate entity exists and the number of the second candidate entities is one, determining the second candidate entity as a target entity to be linked with target data; and when second candidate entities exist and the number of the second candidate entities is more than or equal to two, determining the target entities to be linked by the target data by utilizing the similarity of the target data and each second candidate entity.
Optionally, determining the target entity to which the target data is to be linked by using the similarity between the target data and each second candidate entity includes: converting each second candidate entity into an entity vector by using a preset vector model, and converting target data into a target vector; and respectively determining the similarity between each entity vector and the target vector, and determining a second candidate entity corresponding to the entity vector with the similarity greater than or equal to a preset threshold as a target entity to be linked with the target data.
Optionally, the extracting the element data from the target data includes: acquiring a component dictionary, and extracting initial data from target data by using the component dictionary, wherein the initial data are characters in the target data, and the component dictionary comprises various types of standard component words and synonym component words; and when the initial data belongs to the synonymous element words, acquiring the standard element words corresponding to the initial data and determining the standard element words as the element data.
Optionally, the method further includes training in the following manner to obtain a preset vector model: acquiring training data and simplifying the training data to obtain a training short sentence; inputting the training short sentence into the initial model by using a TSDAE frame for training, and outputting a training result; and under the condition that the training result indicates that the matching accuracy of the initial model to the training short sentence reaches a second threshold value, determining the initial model as a preset vector model.
According to another aspect of the embodiments of the present application, there is also provided an entity linking apparatus, including: the extraction module is used for extracting element data from target data, wherein the target data comprises fields to be linked to corresponding entity data; the matching module is used for matching the element data with entity elements in a preset element library to obtain a matching result, wherein the preset element library is a database of the entity elements which are arranged in advance and comprise a plurality of entity data; and the determining module is used for determining a target entity to be linked of the target data according to the matching result.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a memory, a processor, a communication interface, and a communication bus, where the memory stores a computer program executable on the processor, and the memory and the processor communicate with each other through the communication bus and the communication interface, and the processor implements the steps of any one of the above methods when executing the computer program.
According to another aspect of embodiments of the present application, there is also provided a computer-readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform any of the methods described above.
The technical scheme of the application can be applied to the design of natural language processing by the deep learning technology.
Compared with the related art, the technical scheme provided by the embodiment of the application has the following advantages:
the application provides an entity linking method, which comprises the following steps: extracting element data from target data, wherein the target data comprises a field to be linked to corresponding entity data; matching the element data with entity elements in a preset element library to obtain a matching result, wherein the preset element library is a database of entity elements which are arranged in advance and comprise a plurality of entity data; and determining a target entity to be linked of the target data according to the matching result.
According to the method, the link entity corresponding to the target data is determined by the method of element matching between the target data and the entity of the preset element library, so that the problems that manual matching efficiency is low and a large amount of labeled data is consumed in model matching are solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the technical solutions in the embodiments or related technologies of the present application, the drawings needed to be used in the description of the embodiments or related technologies will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without any creative effort.
Fig. 1 is a flowchart of an alternative entity linking method provided in an embodiment of the present application;
FIG. 2 is a block diagram of an alternative entity linking apparatus provided in accordance with an embodiment of the present application;
fig. 3 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.
As network data grows exponentially, networks have become one of the largest data repositories, and large amounts of data are presented in natural language on networks, but natural language itself is highly ambiguous, especially for some more frequent entities, which may correspond to multiple names. Therefore, if the mentioned words in the network data can be connected with the entities in the knowledge base, we can label the natural language on the network, which provides great convenience for us to understand the semantic information of the network data. However, entities corresponding to the mentioning words may be various, the corresponding relationship of manual arrangement is limited, and a large amount of labeled data is consumed for training only depending on model matching, so that a good effect is difficult to achieve.
In some existing schemes, entity linking directly converts candidate information and entity information into vectors to calculate the similarity between vectors. The vector representation method cannot accurately express the associated entity information, and often some references with similar structures but different meanings are matched with the entities, so that the accuracy is low. Meanwhile, the method of directly using the vector greatly depends on a model represented by a training vector and training data, and a large amount of marking data is usually required to be consumed to obtain a good effect, which means that a large amount of manpower for manual marking is required to be invested; in other existing schemes, description information of entities in a knowledge base or information such as attributes and relationships of the entities needs to be acquired to link the entities. According to the scheme, firstly, a large amount of manual sorting and knowledge base construction are needed to obtain relevant entity information, entity linking cannot be carried out on entities which do not obtain or do not have the relevant information, and for the task of entity linking, more redundant information may exist in the information in the knowledge base, or the information in the knowledge base is not enough to distinguish different entities, so that a method for constructing the information is difficult to achieve a good effect; in many schemes, entity linking is performed by directly performing similarity calculation on a reference word and an entity, and such schemes only provide the entity most related to the reference word and cannot accurately provide whether a plurality of entities correspond to the reference word.
In order to solve the problems mentioned in the background, according to an aspect of an embodiment of the present application, there is provided an entity linking method, as shown in fig. 1, including:
step 101, extracting element data from target data, wherein the target data comprises a field to be linked to corresponding entity data;
103, matching the element data with entity elements in a preset element library to obtain a matching result, wherein the preset element library is a database of entity elements which are arranged in advance and comprise a plurality of entity data;
and 105, determining a target entity to be linked of the target data according to the matching result.
Specifically, the entity link refers to a task of associating entity mentions (entity descriptions) appearing in the natural language text to corresponding knowledge graph entities, such as linking corresponding entries in a standard database, a knowledge base, a place name dictionary, a wikipedia page and the like. For example, for the text "jodan is a famous basketball player in the us NBA", the words such as the character strings "jodan", "us", "NBA", etc. should be mapped to the entities corresponding to the preset element library.
The difficulty of entity linking is the phenomenon of multiword meaning, for example, standard names, aliases, name abbreviations, etc. of an entity can all be used to refer to the entity. Through manual sorting, some commonly used mentions can be associated with an entity, and the mentions are mapped to the entity in a mode of inquiring a knowledge base, however, in network data, the mentions corresponding to each entity may be various, the corresponding relationship of manual sorting is limited, and a good effect is difficult to achieve.
Optionally, the preset element library is a preset element library of a plurality of entities, wherein the preset element library includes major elements and minor elements of the entities and synonymous elements of each major element and/or each minor element, and the data sources may be different types of entity data collected in advance, or target data after the identification is completed is updated to the preset element library.
According to the method, the main elements and the secondary elements of the target data (namely the mentioning words) are respectively matched with the entity elements of the preset element library to obtain the data of the target entity with higher relevance for linking, and based on the element matching method, the accuracy is improved, and the problem that one-to-many cannot be realized is solved.
As an alternative embodiment, matching the element data with the entity elements in the preset element library includes: matching the first main element in the element data with the second main element of each entity element respectively; in the case that all second main elements of the plurality of entity elements are completely matched with the first main elements, determining entities corresponding to the plurality of entity elements as first candidate entities, and respectively matching the first secondary elements with second secondary elements of the plurality of first candidate entities, wherein the element data comprises the first main elements and the first secondary elements, and the entity elements comprise the second main elements and the second secondary elements; in the case where the number of first candidate entities for which the number of second minor elements matching the first minor elements is greater than or equal to the first threshold is greater than or equal to zero, the first candidate entities for which the number of second minor elements matching the first minor elements is greater than or equal to the first threshold are determined as the second candidate entities.
Alternatively, the matching method of the element words may be keyword matching, semantic matching, and the like, which is not limited in this application.
Specifically, the main elements and the minor elements in the preset element library are pre-divided, when different types of element vocabularies are constructed, firstly, the part-of-speech elements such as 'fruit type', 'place name' and 'trademark name' are identified through part-of-speech identification, then, the filtering is carried out through the standard vocabularies, and the part-of-speech element types which appear in the standard vocabularies and can distinguish different standard words are used as the main elements to obtain the element type list of the main elements. In the matching process, the existing element type table of the primary element is required to be relied on to distinguish the primary element from the secondary element.
Since the main elements can more accurately represent the attributes, properties, word senses and the like of the mentioned words, the first main element of the target data is matched with the second main elements of all entities in the preset element library, and then the entity of which all the second main elements are completely matched with the first main element of the target data is determined as a first candidate entity, namely an entity entering the next round of matching of the main elements.
Specifically, the first threshold may be set in advance according to the number of matched elements.
Optionally, the first secondary elements are respectively matched with second secondary elements of a plurality of first candidate entities, the number of element words in the second secondary elements of the first candidate entities, which are the same as the first secondary elements of the target data, is queried, and the first candidate entity with the largest matching number is determined as the second candidate entity, for example, if there are 10 first candidate entities, the first secondary elements of the target data are respectively matched with the second secondary elements of the 10 first candidate entities, the matching result is that there are 5 element words in which the second secondary elements of the 4 first candidate entities are the same as the first secondary elements of the target data, there are 4 word elements in which the second secondary elements of the 3 first candidate entities are the same as the first secondary elements of the target data, and there are 2 word elements in which the second secondary elements of the 2 first candidate entities are the same as the first secondary elements of the target data, the second minor element of the 1 first candidate entities has 1 element word that is the same as the first minor element of the target data, and then the first threshold is 5, and those 4 first candidate entities with the largest number of matches (5 element word matches) are determined as the second candidate entities.
As an alternative embodiment, after the first main elements in the element data are respectively matched with the second main elements of the entity elements, the method further includes: when a second main element of only one entity element is completely matched with the first main element, determining entity data corresponding to the entity element as a second candidate entity; when a second principal component of the non-existent entity components is completely matched with the first principal component, it is determined that a second candidate entity is non-existent.
Alternatively, when the second main element of the non-existing entity elements completely matches with the first main element, we may determine that the second candidate entity does not exist, that is, the target entity to be linked of the target data does not exist, and if it is still desired to find the target entity closest to the target data, the entity corresponding to the entity element with the largest number of matches with the first main element may be determined as the target entity.
Alternatively, when there is only one second main element of the entity elements that is completely matched with the first main element, the entity data corresponding to the entity elements may be directly determined as a second candidate entity, that is, as a target entity to be linked to the target data.
As an alternative embodiment, determining the target entity to be linked with the target data according to the matching result includes: when the second candidate entity does not exist, determining that the target data does not have a target entity to be linked; when a second candidate entity exists and the number of the second candidate entities is one, determining the second candidate entity as a target entity to be linked with target data; and when second candidate entities exist and the number of the second candidate entities is more than or equal to two, determining the target entities to be linked by the target data by utilizing the similarity of the target data and each second candidate entity.
After the first main element and the first secondary element of the target data are respectively matched with the entity elements in the preset element library, no matter how many matching conditions are finally met (if the condition is not zero), the first main element and the first secondary element are determined as second candidate entities.
When the second candidate entity does not exist, determining that the target data does not have a target entity to be linked, and directly returning a result that the target entity does not exist to the user; and when the second candidate entity exists and the number of the second candidate entities is one, determining the second candidate entity as a target entity to be linked with the target data, and directly returning the target entity as a link result to the user.
When there are second candidate entities and the number of the second candidate entities is greater than or equal to two, we will continue to determine the target entity from the plurality of second candidate entities, as further explained below with respect to the method for determining the target entity.
As an alternative embodiment, determining the target entity to which the target data is to be linked by using the similarity between the target data and each second candidate entity includes: converting each second candidate entity into an entity vector by using a preset vector model, and converting target data into a target vector; and respectively determining the similarity between each entity vector and the target vector, and determining a second candidate entity corresponding to the entity vector with the similarity greater than or equal to a preset threshold as a target entity to be linked with the target data.
And similarly, target data is input into the preset vector model to obtain a target vector, and then the cosine similarity of the target vector and each entity vector is calculated respectively.
Specifically, the formula for calculating the cosine similarity is as follows:
Figure BDA0003621716550000101
wherein, P (,) Representing cosine values of a target vector and an ith entity vector, the target vector being (x) 1 ,y 1 ) The ith entity vector is (x) i ,y i )。
Cosine values range between [ -1, 1], the closer the value is to 1, the closer (i.e., more similar) the directions representing the two vectors are; the closer they approach-1, the more opposite their direction; close to 0 means that the two vectors are nearly orthogonal.
After the cosine values of the target vector and the entity vectors are determined, the cosine values are arranged in the descending order, and then the preset threshold value is determined according to the number of the target entities to be acquired. For example, if 3 target entities are to be obtained, the cosine value of the rank 3 is directly determined as a preset threshold, so that 3 entity vectors with similarity greater than or equal to the preset threshold can be found out, and the corresponding 3 target entities are determined.
Alternatively, one target data may match a plurality of target entities having higher similarity.
According to the method, a plurality of second candidate entities are obtained through element matching, and then the similarity among the entities is determined through vector calculation, so that the problem that some mentions with similar structures but different meanings are matched with the entities is solved.
Optionally, the present application further provides an optional embodiment, when the matching result of matching based on the elements indicates that there is no entity element matching with the first main element or the first secondary element of the target data, that is, there is no first candidate entity and no second candidate entity, vector conversion is directly performed on the matched entity by using a preset vector model, then the target data and each entity vector are matched based on the vector, and after the matching result is obtained, the matching result is returned to the user.
According to the embodiment provided by the application, the recall rate is also ensured while the accuracy is improved based on a method combining element matching and vector calculation.
As an alternative embodiment, extracting the element data from the target data includes: acquiring a component dictionary, and extracting initial data from target data by using the component dictionary, wherein the initial data are characters in the target data, and the component dictionary comprises various types of standard component words and synonym component words; and when the initial data belongs to the synonymous element words, acquiring the standard element words corresponding to the initial data and determining the standard element words as the element data.
The element dictionary and the preset element library are preset knowledge libraries comprising a plurality of entity elements, wherein the element dictionary further comprises synonymous element words of the element words.
Illustratively, when the target data is "intense peach and pomegranate flavor", the extracted initial data includes "intense", "peach and" pomegranate ", and the" intense "and" pomegranate "are found to be standard element words and the" peach "does not belong to the standard element word (which is a synonym) through matching with the element dictionary, then the standard element word corresponding to the" peach "is inquired in the element dictionary, and the inquired standard element word" peach "and the previous" intense "and" pomegranate "are determined together to be the element data of the target data.
According to the method and the device, the standard element words are set and the synonym element words are utilized to search the standard element words, so that the element words can be more unified, and interference factors of follow-up element matching are reduced.
As an optional embodiment, the method further includes training in the following manner to obtain a preset vector model: acquiring training data and simplifying the training data to obtain a training short sentence; inputting the training short sentence into the initial model by using a TSDAE frame for training, and outputting a training result; and under the condition that the training result indicates that the matching accuracy of the initial model to the training short sentence reaches a second threshold value, determining the initial model as a preset vector model.
Optionally, the reduction processing on the training data includes performing a normalization processing and a segmentation processing of the short sentence. The standardization processing comprises the steps of carrying out simplified and complex conversion, full-angle and half-angle conversion, noise filtration and the like on sentences; the segmentation processing of the short sentence needs to segment the sentence into a form of the short sentence. In the entity link, the length of the entity is generally short, and the effect of the entity link can be improved by processing the entity link into a short sentence form.
The second threshold is a value less than or equal to 100% that can be preset.
Optionally, when the training result indicates that the matching accuracy of the initial model to the training short sentence does not reach the second threshold, the model is trained by using the training data after the model parameters are adjusted until the matching accuracy of the initial model to the training short sentence does not reach the second threshold.
Optionally, an initial model provided by the present application is a BERT-Base-Chinese pre-training model.
The TSDAE (Transformer-based Sequential Denoising Auto-Encoder) framework unsupervised trains a sentence vector using sentences as training data. During training, the TSDAE framework trains training data in a self-coding mode without manual marking, and manual workload is reduced.
The application provides an entity linking method, which comprises the following steps: extracting element data from target data, wherein the target data comprises a field to be linked to corresponding entity data; matching the element data with entity elements in a preset element library to obtain a matching result, wherein the preset element library is a database of entity elements which are arranged in advance and comprise a plurality of entity data; and determining a target entity to be linked of the target data according to the matching result.
According to the method, the link entity corresponding to the target data is determined by the method of element matching between the target data and the entity of the preset element library, so that the problems that manual matching efficiency is low and a large amount of labeled data is consumed in model matching are solved.
According to another aspect of the embodiments of the present application, there is also provided an entity linking apparatus, as shown in fig. 2, including:
an extracting module 202, configured to extract element data from target data, where the target data includes a field to be linked to corresponding entity data;
the matching module 204 is configured to match the element data with entity elements in a preset element library to obtain a matching result, where the preset element library is a database of entity elements that are sorted in advance and include a plurality of entity data;
and the determining module 206 is configured to determine a target entity to be linked with the target data according to the matching result.
It should be noted that the extracting module 202 in this embodiment may be configured to execute step 101 in this embodiment, the matching module 204 in this embodiment may be configured to execute step 103 in this embodiment, and the determining module 206 in this embodiment may be configured to execute step 105 in this embodiment.
Optionally, the matching module 204 is further configured to match the first main elements in the element data with the second main elements of the entity elements respectively; in the case that all second main elements of the plurality of entity elements are completely matched with the first main elements, determining entities corresponding to the plurality of entity elements as first candidate entities, and respectively matching the first secondary elements with second secondary elements of the plurality of first candidate entities, wherein the element data comprises the first main elements and the first secondary elements, and the entity elements comprise the second main elements and the second secondary elements; in the case where the number of first candidate entities for which the number of second minor elements matching the first minor elements is greater than or equal to the first threshold is greater than or equal to zero, the first candidate entities for which the number of second minor elements matching the first minor elements is greater than or equal to the first threshold are determined as the second candidate entities.
Optionally, the matching module 204 is further configured to: when a second main element of only one entity element is completely matched with the first main element, determining entity data corresponding to the entity element as a second candidate entity; when a second principal component of the non-existent entity components is completely matched with the first principal component, it is determined that a second candidate entity is non-existent.
Optionally, the determining module 206 further includes:
the first determining submodule is used for determining that the target entity to be linked does not exist in the target data when the second candidate entity does not exist;
the second determining submodule is used for determining the second candidate entity as a target entity to be linked with the target data when the second candidate entity exists and the number of the second candidate entities is one;
and the third determining submodule is used for determining the target entities to be linked of the target data by utilizing the similarity of the target data and each second candidate entity when the second candidate entities exist and the number of the second candidate entities is more than or equal to two.
Optionally, the third determining sub-module is further configured to convert each second candidate entity into an entity vector by using a preset vector model, and convert the target data into a target vector; and respectively determining the similarity between each entity vector and the target vector, and determining a second candidate entity corresponding to the entity vector with the similarity greater than or equal to a preset threshold as a target entity to be linked with the target data.
Optionally, the extracting module 202 is further configured to obtain a component dictionary, and extract initial data from the target data by using the component dictionary, where the initial data is a character in the target data, and the component dictionary includes various types of standard component words and synonymous component words; and when the initial data belongs to the synonymous element words, acquiring the standard element words corresponding to the initial data and determining the standard element words as the element data.
Optionally, the entity linking apparatus further includes a training module, configured to train in the following manner to obtain a preset vector model: acquiring training data and simplifying the training data to obtain a training short sentence; inputting the training short sentence into the initial model by using a TSDAE frame for training, and outputting a training result; and under the condition that the training result indicates that the matching accuracy of the initial model to the training short sentence reaches a second threshold value, determining the initial model as a preset vector model.
It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments.
According to another aspect of the embodiments of the present application, as shown in fig. 3, the present application provides an electronic device, which includes a memory 31, a processor 32, a communication interface 33, and a communication bus 34, wherein a computer program operable on the processor 32 is stored in the memory 31, the memory 31 and the processor 32 communicate through the communication bus 34 and the communication interface 33, and the steps of the method are implemented when the processor 32 executes the computer program.
The memory and the processor in the electronic equipment are communicated with the communication interface through a communication bus. The communication bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc.
The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.
According to another aspect of embodiments of the present application, there is provided a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the steps of any of the methods described above.
Optionally, in an embodiment of the present application, a computer readable medium is configured to store program code for the processor to perform the above method steps.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.
When the embodiments of the present application are specifically implemented, reference may be made to the above embodiments, and corresponding technical effects are achieved.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk. It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An entity linking method, comprising:
extracting element data from target data, wherein the target data comprises fields to be linked to corresponding entity data;
matching the element data with entity elements in a preset element library to obtain a matching result, wherein the preset element library is a database of entity elements which are arranged in advance and comprise a plurality of entity data;
and determining a target entity to be linked of the target data according to the matching result.
2. The method according to claim 1, wherein the matching the element data with the entity elements in the pre-defined element library comprises:
matching the first main element in the element data with the second main element of each entity element respectively;
in the case that all the second main elements of the entity elements are completely matched with the first main element, determining the entity corresponding to the entity elements as a first candidate entity, and respectively matching a first secondary element with a second secondary element of the first candidate entity, wherein the element data comprises the first main elements and the first secondary element, and the entity elements comprise the second main elements and the second secondary element;
determining the first candidate entity having the number of the second secondary elements matched with the first secondary element larger than or equal to a first threshold as a second candidate entity, in case the number of the first candidate entity having the number of the second secondary elements matched with the first secondary element larger than or equal to the first threshold is larger than or equal to zero.
3. The method according to claim 2, wherein after the matching of the first main components in the component data with the second main components of the respective entity components, the method further comprises:
determining the entity data corresponding to the entity element as the second candidate entity when the second main element of only one entity element completely matches the first main element;
determining that the second candidate entity does not exist when the second principal component for which the entity component does not exist completely matches the first principal component.
4. The method of claim 3, wherein the determining the target entity to which the target data is to be linked according to the matching result comprises:
when the second candidate entity does not exist, determining that the target entity to be linked does not exist in the target data;
when the second candidate entity exists and the number of the second candidate entities is one, determining the second candidate entity as the target entity to be linked by the target data;
and when the second candidate entities exist and the number of the second candidate entities is more than or equal to two, determining the target entity to be linked by the target data by utilizing the similarity of the target data and each second candidate entity.
5. The method of claim 4, wherein the determining the target entity to which the target data is to be linked by using the similarity between the target data and each of the second candidate entities comprises:
converting each second candidate entity into an entity vector by using a preset vector model, and converting the target data into a target vector;
and respectively determining the similarity between each entity vector and the target vector, and determining the second candidate entity corresponding to the entity vector with the similarity greater than or equal to a preset threshold as the target entity to be linked with the target data.
6. The method of claim 1, wherein extracting element data from the target data comprises:
acquiring a component dictionary, and extracting initial data from the target data by using the component dictionary, wherein the initial data are characters in the target data, and the component dictionary comprises various types of standard component words and synonym component words;
and under the condition that the initial data belongs to the standard element words, directly determining the initial data as the element data, and under the condition that the initial data belongs to the synonymous element words, acquiring the standard element words corresponding to the initial data, and determining the standard element words as the element data.
7. The method of claim 5, further comprising training in a manner to obtain the predetermined vector model:
acquiring training data and simplifying the training data to obtain a training short sentence;
inputting the training short sentence into an initial model by using a TSDAE frame for training, and outputting a training result;
and under the condition that the training result indicates that the matching accuracy of the initial model to the training short sentence reaches a second threshold value, determining the initial model as the preset vector model.
8. An entity linking apparatus, comprising:
the extraction module is used for extracting element data from target data, wherein the target data comprises fields to be linked to corresponding entity data;
the matching module is used for matching the element data with entity elements in a preset element library to obtain a matching result, wherein the preset element library is a database of the entity elements which are arranged in advance and comprise a plurality of entity data;
and the determining module is used for determining the target entity to be linked of the target data according to the matching result.
9. An electronic device comprising a memory, a processor, a communication interface and a communication bus, wherein the memory stores a computer program operable on the processor, and the memory and the processor communicate via the communication bus and the communication interface, wherein the processor implements the steps of the method according to any of the claims 1 to 7 when executing the computer program.
10. A computer-readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to perform the method of any of claims 1 to 7.
CN202210470501.0A 2022-04-28 2022-04-28 Entity linking method, device, equipment and medium Pending CN114860954A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210470501.0A CN114860954A (en) 2022-04-28 2022-04-28 Entity linking method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210470501.0A CN114860954A (en) 2022-04-28 2022-04-28 Entity linking method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN114860954A true CN114860954A (en) 2022-08-05

Family

ID=82635206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210470501.0A Pending CN114860954A (en) 2022-04-28 2022-04-28 Entity linking method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114860954A (en)

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN106649818B (en) Application search intention identification method and device, application search method and server
US20200081899A1 (en) Automated database schema matching
AU2020372605B2 (en) Mapping natural language utterances to operations over a knowledge graph
CN107430612A (en) Search document of the description to the solution of computational problem
CN106407113B (en) A kind of bug localization method based on the library Stack Overflow and commit
CN110222045A (en) A kind of data sheet acquisition methods, device and computer equipment, storage medium
CN109388743B (en) Language model determining method and device
US11017002B2 (en) Description matching for application program interface mashup generation
CN111611807A (en) Keyword extraction method and device based on neural network and electronic equipment
CN109472022B (en) New word recognition method based on machine learning and terminal equipment
CN112270188A (en) Questioning type analysis path recommendation method, system and storage medium
CN107665221A (en) The sorting technique and device of keyword
CN111506595B (en) Data query method, system and related equipment
CN114860870A (en) Text error correction method and device
CN104572632A (en) Method for determining translation direction of word with proper noun translation
CN111460808B (en) Synonymous text recognition and content recommendation method and device and electronic equipment
CN113139383A (en) Document sorting method, system, electronic equipment and storage medium
CN114860954A (en) Entity linking method, device, equipment and medium
CN112541357B (en) Entity identification method and device and intelligent equipment
CN113128231A (en) Data quality inspection method and device, storage medium and electronic equipment
CN110175241B (en) Question and answer library construction method and device, electronic equipment and computer readable medium
CN110162614B (en) Question information extraction method and device, electronic equipment and storage medium
Hidayat et al. Comparison of the use of bigrams and stopword removal for classification using naive bayes (case study on sentiment analysis of by. u internet users)
CN112925961A (en) Intelligent question and answer method and device based on enterprise entity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination