CN113360665A - Method and system for associating knowledge base document and knowledge graph entity - Google Patents
Method and system for associating knowledge base document and knowledge graph entity Download PDFInfo
- Publication number
- CN113360665A CN113360665A CN202110601045.4A CN202110601045A CN113360665A CN 113360665 A CN113360665 A CN 113360665A CN 202110601045 A CN202110601045 A CN 202110601045A CN 113360665 A CN113360665 A CN 113360665A
- Authority
- CN
- China
- Prior art keywords
- entity
- candidate
- text
- list
- entities
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000004364 calculation method Methods 0.000 claims abstract description 15
- 239000013598 vector Substances 0.000 claims description 41
- 238000004590 computer program Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 5
- 238000001228 spectrum Methods 0.000 claims description 4
- 230000000875 corresponding effect Effects 0.000 description 21
- 238000010586 diagram Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000002085 persistent effect Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 239000000028 HMX Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- UZGLIIJVICEWHF-UHFFFAOYSA-N octogen Chemical compound [O-][N+](=O)N1CN([N+]([O-])=O)CN([N+]([O-])=O)CN([N+]([O-])=O)C1 UZGLIIJVICEWHF-UHFFFAOYSA-N 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- IMXSCCDUAFEIOE-UHFFFAOYSA-N D-Octopin Natural products OC(=O)C(C)NC(C(O)=O)CCCN=C(N)N IMXSCCDUAFEIOE-UHFFFAOYSA-N 0.000 description 1
- IMXSCCDUAFEIOE-RITPCOANSA-N D-octopine Chemical compound [O-]C(=O)[C@@H](C)[NH2+][C@H](C([O-])=O)CCCNC(N)=[NH2+] IMXSCCDUAFEIOE-RITPCOANSA-N 0.000 description 1
- 241000238413 Octopus Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000002355 dual-layer Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000008719 thickening Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Animal Behavior & Ethology (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method and a system for associating knowledge base documents and knowledge graph entities, wherein the method comprises the following steps: performing entity identification on the text to obtain an entity list; searching in a knowledge graph library according to entities in the entity list to obtain at least one candidate entity; respectively calculating the similarity of the first characteristic information of the text and the second characteristic information of each candidate entity and at least one associated node of the candidate entities, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity; and associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold. The method and the device can effectively improve the accuracy and the recall rate of entity association.
Description
Technical Field
The invention relates to the field of knowledge graphs, in particular to a method and a system for associating knowledge base documents and knowledge graph entities.
Background
With the rise and rapid development of the internet, knowledge engineering and artificial intelligence, text data is explosively increased, and people urgently need an efficient and intelligent text analysis technology to understand the real meaning of the data, so that people or organizations are helped to quickly acquire useful information. The entity association technology is a text analysis technology, which associates words or phrases appearing in text data as entities with corresponding entity IDs in a knowledge graph library. Therefore, people can understand the real meaning of the text data through entity association, and great convenience is provided for people to understand semantic information of the text data.
The main method of the current entity association is to calculate the similarity of the entities in the text and the context semantic vectors of the text entities and the attribute vectors of the candidate entities in the map, rank the similarity values, associate the similarity values with the knowledge base entities if the similarity values exceed a threshold value, and otherwise, do not associate. One problem with this approach is that if the context description information for some entity names in the knowledge base document has a low correlation with the entity attributes in the graph, but has a high correlation with other information, such as a relationship node, a first degree relationship, a second degree relationship, etc., it cannot be correlated with the entity ID in the graph, resulting in a low accuracy and recall rate of entity correlation.
For example, the following text:
in the day ago, famous singers korea hanyamin appeared together with octopine on their own initiated tibet public welfare activity release meetings. It is known that in the beginning of the next month, Hanhong, as many as hundreds of love people and medical experts form a love fleet of rescue volunteers for 20 days of public service travel.
And (3) carrying out entity recognition on the text to recognize the name of a person: korean red, Yaoming and Zhangzi Yi, and the three names are the names of entities to be linked. The context semantics related to the chapter yi are all commonwise related, but the entity chapter yi stored in the knowledge graph is all film-television related in attribute description, and when the similarity of the semantic vector and the entity attribute is calculated, the score is very low, and the chapter yi cannot be linked. But the chapter yi has a one-degree relationship node which is a charity emissary, so that the chapter yi in the article and the chapter yi in the knowledge base can be linked through calculation.
Disclosure of Invention
Aiming at the technical problems, the invention provides a method and a system for associating knowledge base documents and knowledge graph entities, which can improve the accuracy and recall rate of entity association.
The technical scheme for solving the technical problems is as follows:
in a first aspect, the present invention provides a method for associating knowledge base documents with knowledge-graph entities, comprising:
performing entity identification on the text to obtain an entity list;
searching in a knowledge graph library according to the entities in the entity list to obtain at least one candidate entity;
respectively calculating the similarity of the first characteristic information of the text and the second characteristic information of each candidate entity and at least one associated node of the candidate entities, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity;
and associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.
The invention has the beneficial effects that:
the similarity is calculated by fully utilizing the characteristic information of the text and the characteristic information of the candidate entity and the associated node searched according to the entity in the text, so that the accuracy and the recall rate of entity association are effectively improved.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.
Further, the position of the entity in the entity list in the document of the knowledge base is inquired, and a position list corresponding to the entity is obtained.
Further, the format of the entity of the document of the knowledge base in the position of the position list is emphasized.
In a second aspect, the present invention further provides a system for associating knowledge base documents with knowledge-graph entities, comprising:
the entity identification module is used for carrying out entity identification on the text to obtain an entity list;
the candidate entity searching module is used for searching in a knowledge spectrum library according to the entities in the entity list to obtain at least one candidate entity;
the similarity calculation module is used for calculating the similarity between the first feature information of the text and the second feature information of each candidate entity and at least one associated node of the candidate entities respectively, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity;
and the entity association module is used for associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.
Further, the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.
Further, still include:
and the position query module is used for querying the position of the entity in the entity list in the document of the knowledge base to obtain a position list corresponding to the entity.
Further, still include:
and the format processing module is used for emphasizing the format of the entity of the document of the knowledge base in the position of the position list.
In a third aspect, the present invention also provides an electronic device, including: the device comprises a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, the processor and the memory are communicated through the bus when the electronic device runs, and the processor executes the machine-readable instructions to execute the steps of the method.
In a fourth aspect, the present invention also provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program performs the steps of the method.
Drawings
FIG. 1 is a flow chart of a method for associating knowledge-base documents with knowledge-graph entities according to an embodiment of the present invention;
FIG. 2 is a block diagram of a system for associating knowledge base documents with knowledge-graph entities according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram illustrating a computing device according to an embodiment of the invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a flowchart of a method for associating knowledge-base documents and knowledge-graph entities according to an embodiment of the present invention, as shown in fig. 1, the method includes:
s1, performing entity recognition on the text to obtain an entity list;
specifically, the text is a section of text in a knowledge base document, and the CRF entity recognition model is used for performing entity recognition on the knowledge base document, recognizing entities such as names of people and objects, and obtaining an entity list of the text.
S2, searching in a knowledge spectrum library according to the entities in the entity list to obtain at least one candidate entity;
as known to those skilled in the art, a knowledge-graph is composed of entities (nodes) and entity relationships (edges), where the entities have descriptive information such as names, attributes, and the like. Entity relationships also have names and attributes, and have directions.
S3, respectively calculating the similarity between the first feature information of the text and the second feature information of each candidate entity and at least one associated node of the candidate entities, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity;
specifically, the first feature information may be a sum of word vectors of feature words of a text where the entity is located, and the specific description is as follows:
segmenting text and calculating each wordThe word frequency of the words (the number of occurrences of a word divided by the total number of words in the document) is ranked from high to low according to the word frequency score. And obtaining Top n vocabularies before ranking as text characteristic words. Adding the word vectors of the determined n feature words:wherein ViThe word vector of the ith word is represented, and textVec represents the text abstract vector to be processed, namely the first characteristic information. The word vector can be obtained by training Chinese pre-training based on encyclopedic data by using FastText (fast text classification algorithm), and the dimension of the word vector is 300 dimensions, the same below.
The associated nodes are nodes such as first degree relationship nodes and second degree relationship nodes which have an associated relationship with the candidate entity in the knowledge graph, the second characteristic information can be the sum of word vectors of node names and attributes, the similarity of the first characteristic information of the text and the candidate entity nodes and the second characteristic information of the first degree relationship nodes and the second degree relationship nodes are calculated and weighted and summed respectively, and the total similarity of the candidate entity can be obtained, and the method specifically comprises the following steps:
1) and carrying out similarity calculation according to sentences. Dividing documents in a knowledge base according to periods, aiming at sentences where entities are located, then according to the sentence number, dividing the sentences to obtain word vectors of the documents, adding the word vectors to form senVec, obtaining word vectors of candidate entity node names and attributes, adding the word vectors to form attrVec, and then calculating cosine similarity of the vectors by using the senVec and the attrVec:| x | | represents the norm of the vector x, giving the score senScore.
2) And acquiring the candidate entity node name and attribute and the word vectors of the one-degree relation node name and attribute of the candidate entity node name and attribute, adding the candidate entity node name and attribute and the word vectors to form firstRelVec, and performing similarity calculation on the firstRelVec and the text abstract textVec to obtain a score firstRelScore.
3) And acquiring the names and the attributes of the candidate nodes and the word vectors of the names and the attributes of the two-degree relation nodes of the candidate nodes, adding the candidate nodes and the attributes to form a secondrelVec, and calculating the similarity of the vectors and the text abstract textVec to obtain a score secondrelScore.
4) The scores of the candidate nodes searched by each entity are respectively set with different weights, and the weights are configurable and then summed.
And S4, associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.
Specifically, if the number of candidate entities searched in the knowledge map library in step S2 is greater than one, feature matching and semantic calculation need to be performed according to step S3, and the maximum total similarity is determined, so as to find the best matching candidate entity. And further judging whether the maximum similarity reaches an association threshold, if so, associating, and returning the entity ID of the candidate entity, namely doc _ ID, if not, not associating.
If only one matched entity is searched, the total similarity is directly calculated through the step S3, whether the correlation threshold is reached is judged, if so, the correlation is carried out, and doc _ id is returned, and the doc _ id is not reached and is not associated.
The method for associating the knowledge base document with the knowledge graph entity provided by the embodiment of the invention can extract effective characteristics, fully utilize the entity existing in the text, the sentence where the entity is located, the text abstract and the entity in the graph, the entity attribute, the first-degree relation and the relation entity and the second-degree relation and the correlation degree of the relation entity, and effectively improve the accuracy and the recall rate of entity association.
The existing entity association method has another problem that the entity in the document is associated with the knowledge graph, but the position of the associated entity in the document is required to be obtained, so that the entity cannot be directly obtained, and particularly when the number of pages of the document is too large. To address this issue, optionally, in this embodiment, the method further includes:
s5, inquiring the position of the entity in the entity list in the document of the knowledge base to obtain a position list corresponding to the entity.
Specifically, the position of the entity in the document of the knowledge base may be the page number of the entity, and in this embodiment, an Elasticsearch engine may be used to query the page number of the entity in the document of the knowledge base, so as to obtain a page number list of all the page numbers where the entity appears. Thus, when the doc _ id of the entity is returned, the page number list corresponding to the doc _ id can be further returned.
In order to further facilitate the user to quickly view the association information of the entity in real time, optionally, in this embodiment, the method further includes:
and S6, emphasizing the format of the entity of the document of the knowledge base in the position of the position list.
Specifically, according to the page list corresponding to doc _ id, emphasis processing such as thickening and highlighting can be performed on the format of the entity in the document page content, so that the document entity corresponding to the entity link can be found quickly and conveniently.
The following illustrates the principles of the present invention, for example, the following text processes:
"the famous singer hanyaoming appears together with octogen on the tibetan public welfare event release party initiated by himself. It is known that in the beginning of the next month, Hanhong, as many as hundreds of love people and medical experts form a love fleet of rescue volunteers for 20 days of public service travel. "
The text above presses first ". "split into two sentences
Sentence 1: "the famous singer hanyaoming appears together with octogen on the tibetan public welfare event release party initiated by himself. "
Sentence 2: "it is known that Korean red will be combined with hundreds of loved persons and medical experts to form a love fleet of recovering volunteers for 20 days of public interest in the beginning of the next month. "
And aiming at the fact that the entity is a sentence, performing word segmentation on the sentence, acquiring word vectors of all words through FastText, and adding the word vectors to form a sentence vector senVec. Acquiring word vectors of node names and attributes of candidate entities of Korean red, Yaoming and octopus, adding the word vectors to form attrVec, and then calculating vector similarity by using senVec and attrVec:| x | | represents the norm of vector x to obtain the scoresenScore。
And then acquiring the node names and attributes of the candidate entities of Korean red, Yaoming and octoyi and word vectors of the first-degree relation node names and attributes of the candidate entities, adding the node names and attributes to form firstLeVec, and performing similarity calculation with the text abstract textVec, wherein the formula is the same as the formula above, so as to obtain the score firstLeScore.
And acquiring the node names and attributes of the candidate entities of Korean, Yaming and Octope and word vectors of the two-degree relation node names and attributes of the candidate entities, adding the word vectors to form secondRelVec, and calculating the vector similarity with the text abstract textVec, wherein the formula is the same as the formula, so as to obtain the score secondRelScore.
The calculated scores are weighted differently and are configurable, and if the score of the one-degree relationship is more weighted, the weight of firstDelScore is set higher, assuming 0.7, the remaining score weight senScore is 0.2, and secondreScore is 0.1, and the scores are multiplied by the weights and summed sum. And comparing sum of each entity with a set threshold, if the sum is greater than the threshold, associating, and obtaining a page list of associated entities according to the page corresponding to each entity obtained from the elastic search.
Fig. 2 is a block diagram of a system for associating knowledge base documents and knowledge graph entities according to an embodiment of the present invention, where functional principles of various modules in the system have been described in the foregoing method embodiment, and are not described in detail below.
As shown in fig. 2, the system includes:
the entity identification module is used for carrying out entity identification on the text to obtain an entity list;
the candidate entity searching module is used for searching in a knowledge spectrum library according to the entities in the entity list to obtain at least one candidate entity;
the similarity calculation module is used for calculating the similarity between the first feature information of the text and the second feature information of each candidate entity and at least one associated node of the candidate entities respectively, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity;
and the entity association module is used for associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.
Optionally, in this embodiment, the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.
Optionally, in this embodiment, the system further includes:
and the position query module is used for querying the position of the entity in the entity list in the document of the knowledge base to obtain a position list corresponding to the entity.
Optionally, in this embodiment, the system further includes:
and the format processing module is used for emphasizing the format of the entity of the document of the knowledge base in the position of the position list.
FIG. 3 is a schematic diagram illustrating a computing device according to an exemplary embodiment of the present invention.
Referring to fig. 3, computing device 300 includes memory 310 and processor 320.
The Processor 320 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 310 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 320 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 310 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 310 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 310 has stored thereon executable code that, when processed by the processor 320, may cause the processor 320 to perform some or all of the methods described above.
The aspects of the invention have been described in detail hereinabove with reference to the drawings. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. Those skilled in the art should also appreciate that the acts and modules referred to in the specification are not necessarily required by the invention. In addition, it can be understood that the steps in the method according to the embodiment of the present invention may be sequentially adjusted, combined, and deleted according to actual needs, and the modules in the device according to the embodiment of the present invention may be combined, divided, and deleted according to actual needs.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out some or all of the steps of the above-described method of the invention.
Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform part or all of the various steps of the above-described method according to the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims (10)
1. A method for associating knowledge base documents with knowledge-graph entities, comprising:
performing entity identification on the text to obtain an entity list;
searching in a knowledge graph library according to the entities in the entity list to obtain at least one candidate entity;
respectively calculating the similarity of the first characteristic information of the text and the second characteristic information of each candidate entity and at least one associated node of the candidate entities, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity;
and associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.
2. The method according to claim 1, wherein the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.
3. The method of claim 1 or 2, further comprising:
and inquiring the position of the entity in the entity list in the document of the knowledge base to obtain a position list corresponding to the entity.
4. The method of claim 3, further comprising:
and emphasizing the format of the entity of the document of the knowledge base in the position of the position list.
5. A system for associating knowledge base documents with knowledge-graph entities, comprising:
the entity identification module is used for carrying out entity identification on the text to obtain an entity list;
the candidate entity searching module is used for searching in a knowledge spectrum library according to the entities in the entity list to obtain at least one candidate entity;
the similarity calculation module is used for calculating the similarity between the first feature information of the text and the second feature information of each candidate entity and at least one associated node of the candidate entities respectively, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity;
and the entity association module is used for associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.
6. The system according to claim 5, wherein the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.
7. The system of claim 5 or 6, further comprising:
and the position query module is used for querying the position of the entity in the entity list in the document of the knowledge base to obtain a position list corresponding to the entity.
8. The system of claim 7, further comprising:
and the format processing module is used for emphasizing the format of the entity of the document of the knowledge base in the position of the position list.
9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the method of any of claims 1 to 4.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110601045.4A CN113360665A (en) | 2021-05-31 | 2021-05-31 | Method and system for associating knowledge base document and knowledge graph entity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110601045.4A CN113360665A (en) | 2021-05-31 | 2021-05-31 | Method and system for associating knowledge base document and knowledge graph entity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113360665A true CN113360665A (en) | 2021-09-07 |
Family
ID=77530391
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110601045.4A Pending CN113360665A (en) | 2021-05-31 | 2021-05-31 | Method and system for associating knowledge base document and knowledge graph entity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113360665A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114417845A (en) * | 2022-03-30 | 2022-04-29 | 支付宝(杭州)信息技术有限公司 | Identical entity identification method and system based on knowledge graph |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160283593A1 (en) * | 2015-03-23 | 2016-09-29 | Microsoft Technology Licensing, Llc | Salient terms and entities for caption generation and presentation |
CN110188168A (en) * | 2019-05-24 | 2019-08-30 | 北京邮电大学 | Semantic relation recognition methods and device |
CN111159423A (en) * | 2019-12-27 | 2020-05-15 | 北京明略软件系统有限公司 | Entity association method, device and computer readable storage medium |
CN112585596A (en) * | 2018-06-25 | 2021-03-30 | 易享信息技术有限公司 | System and method for investigating relationships between entities |
CN112633000A (en) * | 2020-12-25 | 2021-04-09 | 北京明略软件系统有限公司 | Method and device for associating entities in text, electronic equipment and storage medium |
-
2021
- 2021-05-31 CN CN202110601045.4A patent/CN113360665A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160283593A1 (en) * | 2015-03-23 | 2016-09-29 | Microsoft Technology Licensing, Llc | Salient terms and entities for caption generation and presentation |
CN112585596A (en) * | 2018-06-25 | 2021-03-30 | 易享信息技术有限公司 | System and method for investigating relationships between entities |
CN110188168A (en) * | 2019-05-24 | 2019-08-30 | 北京邮电大学 | Semantic relation recognition methods and device |
CN111159423A (en) * | 2019-12-27 | 2020-05-15 | 北京明略软件系统有限公司 | Entity association method, device and computer readable storage medium |
CN112633000A (en) * | 2020-12-25 | 2021-04-09 | 北京明略软件系统有限公司 | Method and device for associating entities in text, electronic equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114417845A (en) * | 2022-03-30 | 2022-04-29 | 支付宝(杭州)信息技术有限公司 | Identical entity identification method and system based on knowledge graph |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106599278B (en) | Application search intention identification method and device | |
WO2018049960A1 (en) | Method and apparatus for matching resource for text information | |
CN110321537B (en) | Method and device for generating file | |
US20220277038A1 (en) | Image search based on combined local and global information | |
CN112364624B (en) | Keyword extraction method based on deep learning language model fusion semantic features | |
WO2021146388A1 (en) | Systems and methods for providing answers to a query | |
CN110019669B (en) | Text retrieval method and device | |
CN111428506B (en) | Entity classification method, entity classification device and electronic equipment | |
US11227183B1 (en) | Section segmentation based information retrieval with entity expansion | |
CN110728135B (en) | Text theme indexing method and device, electronic equipment and computer storage medium | |
CN113032584A (en) | Entity association method, entity association device, electronic equipment and storage medium | |
Renjit et al. | CUSAT NLP@ AILA-FIRE2019: Similarity in Legal Texts using Document Level Embeddings. | |
Blanco et al. | Overview of NTCIR-13 Actionable Knowledge Graph (AKG) Task. | |
JP6340351B2 (en) | Information search device, dictionary creation device, method, and program | |
CN113360665A (en) | Method and system for associating knowledge base document and knowledge graph entity | |
US9087293B2 (en) | Categorizing concept types of a conceptual graph | |
CN114238744A (en) | Data processing method, device and equipment | |
CN112818206A (en) | Data classification method, device, terminal and storage medium | |
Jamil et al. | A subject identification method based on term frequency technique | |
US7849037B2 (en) | Method for using the fundamental homotopy group in assessing the similarity of sets of data | |
KR102028155B1 (en) | Document scoring method and document searching system | |
JP2003263441A (en) | Keyword determination database preparing method, keyword determining method, device, program and recording medium | |
CN112417154B (en) | Method and device for determining similarity of documents | |
CN113139383A (en) | Document sorting method, system, electronic equipment and storage medium | |
CN113515940B (en) | Method and equipment for text search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |