CN112084777B - Entity linking method - Google Patents

Entity linking method Download PDF

Info

Publication number
CN112084777B
CN112084777B CN202010915722.5A CN202010915722A CN112084777B CN 112084777 B CN112084777 B CN 112084777B CN 202010915722 A CN202010915722 A CN 202010915722A CN 112084777 B CN112084777 B CN 112084777B
Authority
CN
China
Prior art keywords
entity
text
candidate
score
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010915722.5A
Other languages
Chinese (zh)
Other versions
CN112084777A (en
Inventor
辛宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua Fusion Media Technology Development Beijing Co ltd
Xinhua Zhiyun Technology Co ltd
Original Assignee
Xinhua Fusion Media Technology Development Beijing Co ltd
Xinhua Zhiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua Fusion Media Technology Development Beijing Co ltd, Xinhua Zhiyun Technology Co ltd filed Critical Xinhua Fusion Media Technology Development Beijing Co ltd
Priority to CN202010915722.5A priority Critical patent/CN112084777B/en
Publication of CN112084777A publication Critical patent/CN112084777A/en
Application granted granted Critical
Publication of CN112084777B publication Critical patent/CN112084777B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention comprises an entity linking method, which comprises the following steps: dividing an input text to obtain a plurality of divided phrases, detecting the part of speech of each position word of the divided phrases, and obtaining a cut text according to the divided phrases where each position word passes the detection; acquiring entity names in the clipping text and candidate entities corresponding to the entity names; calculating to obtain an entity relevance score according to the first appearance position and the appearance times of the entity index in the input text and the text length of the input text, calculating the entity information coincidence degree between the first text in the input text and the second text corresponding to the candidate entity through the detected entity index, and obtaining a candidate entity richness score; calculating to obtain entity scores of candidate entities; storing the candidate entity corresponding to the entity score into a disambiguation list of the entity to be disambiguated; entity designations are linked to the filtered candidate entities. The invention has the beneficial effects that: the time of the entity disambiguation algorithm is shortened, and the overall flow efficiency of entity linkage is improved.

Description

Entity linking method
Technical Field
The invention relates to the field of databases, in particular to an entity linking method.
Background
With the increasing growth of text information, how to extract and utilize entities contained in text to obtain text subject matter information is more and more important, and entity linking is a core technology for associating extracted entities in text with entity libraries.
Entity links typically comprise two core phases, the first being an entity recognition phase, i.e. recognizing entity names present in the input text and obtaining all candidate entities in the entity library related thereto; the first core stage is an entity disambiguation stage, and the entity disambiguation stage carries out similarity matching on the associated text of each entity index in the text and the associated text corresponding to the corresponding candidate entity in the entity library, so that a better candidate entity corresponding to the entity index is found out from a large number of homonymous candidate entities in the entity library, and the link between the entity index and the better candidate entity is realized.
The current physical links have several drawbacks in use:
firstly, because the two main core stages of entity link are usually realized by adopting a deep learning algorithm, the time consumption is long, and the method is difficult to use in a high concurrence and real-time scene;
secondly, in order to be used in a complex real-time scene, usually, the entity link can only process short text, and when facing long text data, an algorithm such as simple statistics or simple similarity is used for replacing a deep learning algorithm in a core stage of the entity link so as to achieve the aim of improving concurrency, and the effect of the entity link is greatly reduced;
in the third, common entity disambiguation stage, tens or even hundreds of candidate entities pointed by each entity need to be disambiguated, but part of entity links do not filter the candidate entities, so that a large number of candidate entities with low quality and poor correlation can cause the overall flow processing efficiency to be slow.
Disclosure of Invention
In view of the above problems in the prior art, an entity linking method is now provided.
The specific technical scheme is as follows:
an entity linking method, comprising the steps of:
dividing an input text to obtain a plurality of divided phrases, detecting the part of speech of each position word of each divided phrase, and splicing the divided phrases where the detected position words are located to obtain a cut text;
acquiring at least one candidate entity corresponding to the entity index in the clipping text;
according to the first appearance position and appearance times of the entity names in the input text, calculating to obtain entity relevance scores of the entity names by combining the text lengths of the input text, and then detecting the entity names corresponding to the entity relevance scores according to the entity relevance scores;
acquiring a first text of the detected entity index in the input text, and calculating entity information coincidence degree between the first text and a second text corresponding to the candidate entity;
acquiring a candidate entity richness score of the candidate entity by combining entity information coincidence degree;
calculating according to the entity relevance score and the candidate entity richness score to obtain the entity score of the candidate entity;
storing candidate entities corresponding to entity scores meeting preset conditions into a to-be-entity disambiguation list;
filtering the candidate entities in the to-be-entity disambiguation list such that the entity references are linked to the filtered candidate entities.
Preferably, the entity linking method comprises the following steps of calculating to obtain an entity relevance score of an entity reference through the following formula;
C3=(L2-I)/L2+log(c)/log(k);
wherein C3 is used to represent an entity relevance score;
l2 is used to represent the text length of the input text;
i is used to represent the first appearance position of the entity name in the input text;
c is used for representing the occurrence number of entity names in the input text;
k is used to represent the adjustment coefficient.
Preferably, the method for entity linking includes segmenting an input text to obtain a plurality of segmented phrases, detecting part of speech of each position word of each segmented phrase, and splicing segmented phrases where each detected position word is located to obtain a cut text, specifically including the following steps:
s11, dividing the input text according to punctuation marks to obtain a plurality of divided phrases after division;
step S12, sequentially storing all the segmentation phrases of the input text into a segmentation phrase list;
step S13, traversing the segmentation phrase list and acquiring the current segmentation phrases;
step S14, word segmentation is carried out on the current segmentation short sentence so as to obtain a plurality of words at all positions corresponding to the current segmentation short sentence;
step S15, performing part-of-speech detection on each position word, and splicing the current segmentation phrases where the position words passing through the part-of-speech detection are located to obtain a spliced text;
step S16, setting the next segmentation phrase as the current segmentation phrase, returning to step S14 until all segmentation phrases in the segmentation phrase list are traversed, so as to set the spliced text as the cut text.
Preferably, the entity linking method further includes, after step S15:
judging whether the length of the spliced text exceeds a preset text length threshold value or not;
if yes, setting the spliced text as a cut text, setting the next cut phrase as the current cut phrase, and returning to the step S14-the step S16 to obtain the next cut text.
Preferably, the entity linking method, wherein, acquiring at least one candidate entity corresponding to the entity reference in the cut text, comprises the following steps:
and inputting the cut text into an entity recognition algorithm model to obtain candidate entities corresponding to the entity names in the cut text.
Preferably, the entity linking method detects entity names corresponding to the entity relevance scores according to the entity relevance scores, and specifically includes:
judging whether the entity correlation score exceeds a preset entity correlation score threshold;
if yes, judging that the entity name corresponding to the entity relevance score passes detection;
if not, judging that the entity name corresponding to the entity relevance score fails to pass the detection.
Preferably, the entity linking method, wherein, acquiring the first text indicated by the detected entity in the input text, calculates the entity information coincidence degree between the first text and the second text corresponding to the candidate entity, specifically includes the following steps:
the method comprises the steps that an entity refers to an upper text and a lower text in an input text, words are segmented on the upper text and the lower text so as to correspondingly obtain an upper text word set and a lower text word set, and the upper text word set and the lower text word set are set as a first text;
acquiring description information of candidate entities, segmenting the description information to obtain a description word set, and setting the description word set as a second text;
and acquiring the entity information coincidence degree between the first text and the second text.
Preferably, the entity linking method, wherein, the candidate entity richness score of the candidate entity is obtained by combining entity information coincidence degree, specifically comprises the following steps:
calculating the length of the description information of the candidate entity;
calculating the editing times of the description information;
calculating the reference times of the description information;
obtaining a candidate entity richness score of the candidate entity according to the following formula:
C2=log(L1+(u/2)+(s/2));
wherein C2 is used to represent a candidate entity richness score;
l1 is used to represent the length of the description information of the candidate entity;
u is used for representing the editing times of the descriptive information;
s is used to indicate the number of references describing the information.
Preferably, the entity linking method, wherein the sum of the entity relevance score and the candidate entity richness score is used as the entity score of the candidate entity.
Preferably, the entity linking method stores candidate entities corresponding to entity scores meeting preset conditions in a to-be-entity disambiguation list, and specifically includes the following steps:
sequentially storing the entity scores of each candidate entity into an entity score storage list according to a preset arrangement sequence;
traversing the entity score corresponding to each candidate entity in the entity score storage list, and storing the candidate entities with the entity scores greater than a preset entity score threshold value into the entity disambiguation waiting list.
The technical scheme has the following advantages or beneficial effects:
firstly, realizing entity linking of an input text with any length by cutting an input file;
secondly, splicing the segmentation phrases of the detected words at all positions to obtain a cut text, so that the segmentation phrases in the cut text basically contain entity names, and the cut text possibly containing the entities is detected in advance, so that the text length of the input text can be greatly shortened, and the entity recognition efficiency is improved;
thirdly, calculating according to the entity relevance score and the candidate entity richness score to obtain an entity score of the candidate entity, and linking the entity index to the candidate entity with the entity score meeting the preset condition; the candidate entities are thereby filtered such that the entity designations are linked to the preferred candidate entity of the corresponding plurality of candidate entities.
Fourth, the entity relevance score between the entity index and the input text is measured by calculating the entity relevance score, the entity index corresponding to the entity relevance score is detected according to the entity relevance score, so that the detected entity index is obtained, the entity index which does not pass through the detection is discarded, and therefore the entity index related to the text subject of the input text is reserved, the time of an entity disambiguation algorithm is shortened, and the overall flow efficiency of entity linkage is further improved.
Fifthly, performing primary filtering through a plurality of candidate entities corresponding to each entity index to obtain candidate entities stored in the to-be-entity disambiguation list, and then performing secondary filtering on the candidate entities stored in the to-be-entity disambiguation list to enable the entity index to be linked to the filtered candidate entities.
Drawings
Embodiments of the present invention will now be described more fully with reference to the accompanying drawings. The drawings, however, are for illustration and description only and are not intended as a definition of the limits of the invention.
FIG. 1 is a flowchart of step S1 of an embodiment of an entity linking method of the present invention;
FIG. 2 is a flowchart of steps S2-S7 of an embodiment of the entity linking method of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
The invention is further described below with reference to the drawings and specific examples, which are not intended to be limiting.
The invention comprises an entity linking method, which comprises the following steps:
step S1, dividing an input text to obtain a plurality of divided phrases, detecting the part of speech of each position word of each divided phrase, and splicing the divided phrases where each detected position word is located to obtain a cut text;
s2, acquiring at least one candidate entity corresponding to the entity index in the clipping text;
step S3, calculating to obtain the entity relevance score of the entity index according to the first appearance position and appearance times of the entity index in the input text and combining the text length of the input text, and then detecting the entity index corresponding to the entity relevance score according to the entity relevance score;
s4, acquiring a first text of the input text indicated by the detected entity, and calculating entity information coincidence degree between the first text and a second text corresponding to the candidate entity;
step S5, obtaining the candidate entity richness score of the candidate entity by combining the entity information coincidence degree;
step S6, calculating to obtain the entity score of the candidate entity according to the entity relevance score and the candidate entity richness score;
step S7, storing candidate entities corresponding to entity scores meeting preset conditions into a to-be-entity disambiguation list;
step S8, filtering the candidate entities in the disambiguation list of the to-be-processed entities, so that the entity references are linked to the filtered candidate entities.
In the above embodiment, the order of the steps S1 to S8 is not set, i.e., the steps S1 to S8 may be performed out of order.
In the above embodiment, the length of the input text is not limited, and in step S1, the input text is split, so as to obtain multiple split phrases of the input text, then the part of speech of each position word of each split phrase is detected, so that the split phrases of each position word detected are spliced to obtain a cut text, and then the input text with any length is subjected to physical link by cutting the input file;
since most sentences in the input text do not contain entities, it is unnecessary to use the invalid sentences for entity recognition, the input text is segmented, and segmented short sentences where the detected words at all positions are located are spliced to obtain the cut text, so that the segmented short sentences in the cut text basically contain entity names, the cut text which possibly contains the entities is detected in advance, the text length of the input text can be greatly shortened, and the efficiency of entity recognition is improved.
Calculating according to the entity relevance score and the candidate entity richness score to obtain an entity score of the candidate entity, and linking the entity index to the candidate entity with the entity score meeting the preset condition; the candidate entities are thereby filtered such that the entity designations are linked to the preferred candidate entity of the corresponding plurality of candidate entities.
And calculating the entity relevance score to measure the entity relevance score between the entity index and the input text, detecting the entity index corresponding to the entity relevance score according to the entity relevance score to acquire the detected entity index, and discarding the entity index which does not pass the detection, so that the entity index related to the text subject of the input text is reserved, the time of an entity disambiguation algorithm is further shortened, and the overall flow efficiency of entity link is further improved.
In the above embodiment, the first filtering is performed on the multiple candidate entities corresponding to each entity index through steps S5-S7 to obtain the candidate entities stored in the disambiguation list of the entity to be treated, and then the second filtering is performed on the candidate entities stored in the disambiguation list of the entity to be treated through step S8, so that the entity index is linked to the filtered candidate entities, where the filtered candidate entities are better candidate entities, that is, the entity index may be linked to at least one filtered candidate entity.
Further, in the above embodiment, the step S1 specifically includes the steps of:
s11, dividing the input text according to punctuation marks to obtain a plurality of divided phrases after division;
step S12, sequentially storing all the segmentation phrases of the input text into a segmentation phrase list;
step S13, traversing the segmentation phrase list and acquiring the current segmentation phrases;
step S14, word segmentation is carried out on the current segmentation short sentence so as to obtain a plurality of words at all positions corresponding to the current segmentation short sentence;
step S15, performing part-of-speech detection on each position word, and splicing the current segmentation phrases where the position words passing through the part-of-speech detection are located to obtain a spliced text;
step S16, setting the next segmentation phrase as the current segmentation phrase, returning to step S14 until all segmentation phrases in the segmentation phrase list are traversed, so as to set the spliced text as the cut text.
In the above embodiment, firstly, the input text is segmented according to punctuation marks to obtain a plurality of segmentation phrases corresponding to the input text, so that the text length is reduced, and the entity recognition efficiency is improved;
and then, word segmentation is carried out on each segmentation short sentence stored in the segmentation short sentence list to obtain a plurality of words at each position, word part detection is carried out on the words at each position obtained after the secondary segmentation is carried out, and segmentation short sentences where the words at each position pass through the word part detection are positioned are spliced to obtain a cut text, so that entity recognition on invalid sentences is avoided, the segmentation short sentences where the words at each position possibly contain the entity are detected in advance through the word part detection, the text length is greatly shortened, and the entity recognition efficiency is improved.
As a preferred embodiment, punctuation marks may include "+|! ,. ? "etc., but punctuation is flexibly selected based on the entity's business, and thus punctuation includes, but is not limited to," +|! ,. ? ".
In the above embodiment, the step S15 specifically includes the steps of:
step S151, detecting the parts of speech of each word at each position by adopting an arbitrary word segmentation device;
step S152, when noun part of speech exists in the part of speech detection result of each position word, setting each position word as each position word passing part of speech detection;
and step S153, splicing the segmentation phrases of the words at each position detected by the part of speech to obtain a cut text.
In step S153, split phrases where the words detected by the part of speech are located may be spliced by periods; for example, the split phrases where the words at the positions detected by the part of speech include a first split phrase and a second split phrase, and a period is provided between the first split phrase and the second split phrase.
It should be noted that the segmentation phrases detected by the part-of-speech may be spliced in other manners, as long as two adjacent segmentation phrases can be segmented.
In the above embodiment, the noun part of speech includes the basic noun part of speech, the person part of speech, the place part of speech, and the organization part of speech.
In the above embodiment, the length of the cut text is smaller than the length of the input text.
Further, as a preferred embodiment, step S15 further includes:
judging whether the length of the spliced text exceeds a preset text length threshold value or not;
if yes, setting the spliced text as a cut text, setting the next cut phrase as the current cut phrase, and returning to the step S14-the step S16 to obtain the next cut text.
In the embodiment, the setting of the spliced text exceeding the preset text length threshold to the cut text is realized by presetting the preset text length threshold, the cut text at the moment is the first cut text, and the cut phrases in the cut phrase list which are not traversed are continuously traversed to obtain the second cut text, so that the influence on the efficiency caused by the overlong text length of the cut text after the secondary cutting is avoided, and the entity recognition efficiency is improved.
For example, first, an input text is divided according to punctuation marks to obtain divided 100 divided phrases;
subsequently, the 100 divided phrases are sequentially stored into a divided phrase list;
then, traversing the segmentation phrase list is started, and the 1 st segmentation phrase in the segmentation phrase list is set as the current segmentation phrase;
then, word segmentation is carried out on the current segmentation short sentence so as to obtain three words at each position corresponding to the current segmentation short sentence;
then, part-of-speech detection is carried out on each of the three position words, and if only one position word passes the part-of-speech detection, the current segmentation phrase where the position word is located is spliced; if all the three words at all the positions do not pass the detection, the segmentation phrases where the three words at all the positions are positioned are not spliced, so that a spliced text is obtained;
and then, setting the next segmentation phrase as the current segmentation phrase, namely setting the 2 nd segmentation phrase as the current segmentation phrase, continuing to execute the steps until all segmentation phrases in the segmentation phrase list are traversed or the length of the spliced text reaches a preset text length threshold, and setting the spliced text as the cut text.
As a preferred embodiment, as shown in fig. 1;
dividing an input text according to punctuation marks to obtain a plurality of divided phrases after division, and storing all the divided phrases of the input text into a divided phrase list;
step two, traversing each segmentation short sentence in the segmentation short sentence list;
thirdly, judging whether all the segmentation phrases in the segmentation phrase list are traversed or not;
if yes, setting the spliced text as a cutting text, and exiting execution;
if not, acquiring the current segmentation short sentence, and segmenting the current segmentation short sentence to obtain a plurality of words at all positions corresponding to the current segmentation short sentence;
fourth, part-of-speech detection is carried out on each position word, and whether each position word passes through the part-of-speech detection is judged;
if yes, splicing the current segmentation phrases of the words at each position detected by the part of speech to obtain a spliced text;
if not, discarding the segmentation phrases where the words at all positions are located, and returning to the third step;
fifthly, judging whether the length of the spliced text exceeds a preset text length threshold value;
if yes, setting the spliced text as a cutting text, setting the next cutting short sentence as the current cutting short sentence, and returning to the third step to obtain the next cutting text;
if not, setting the next segmentation phrase as the current segmentation phrase, returning to the third step, namely, judging whether all segmentation phrases in the segmentation phrase list are traversed again until all segmentation phrases in the segmentation phrase list are traversed, and setting the spliced text as the cut text.
Further, in the above embodiment, step S2 includes the steps of:
and inputting the cut text into an entity recognition algorithm model to obtain candidate entities corresponding to the entity names in the cut text.
Each entity index can correspond to a plurality of candidate entities, so that the subsequent filtering of the plurality of candidate entities of each entity index is realized, and the linking of each entity index into a better candidate entity is realized.
In the above embodiment, all entity designations of the cut text are sequentially stored in an entity list, and all candidate entities corresponding to each entity designation are sequentially stored in a candidate entity list; i.e. the cut text corresponds to a list of entities, each entity in the list of entities referring to a list of candidate entities.
Further, in the above embodiment, the entity relevance score of the entity finger is calculated by the following formula;
C3=(L2-I)/L2+log(c)/log(k);(1)
wherein, in the above formula (1), C3 is used to represent the entity relevance score;
l2 is used to represent the text length of the input text;
i is used to represent the first appearance position of the entity name in the input text;
c is used for representing the occurrence number of entity names in the input text;
k is used to represent the adjustment coefficient.
In the above embodiment, in step S3, detecting the entity designation corresponding to the entity relevance score according to the entity relevance score specifically includes:
judging whether the entity correlation score exceeds a preset entity correlation score threshold;
if yes, judging that the entity name corresponding to the entity relevance score passes detection;
if not, judging that the entity name corresponding to the entity relevance score fails to pass the detection.
In the above embodiment, the entity names corresponding to the entity relevance scores lower than the preset entity relevance score threshold are discarded, so that the entity names which do not need to be identified are prevented from being identified, and the efficiency of entity identification is improved.
Further, in the above embodiment, the step S4 specifically includes the steps of:
step S41, obtaining an upper text and a lower text of an entity name in an input text, segmenting the upper text and the lower text to correspondingly obtain an upper text word set and a lower text word set, and setting the upper text word set and the lower text word set as first texts;
step S42, obtaining description information of candidate entities, and word segmentation is carried out on the description information to obtain a description word set, and the description word set is set as a second text;
step S43, obtaining the entity information coincidence degree between the first text and the second text.
In the above embodiment, there is no fixed sequence between step S41 and step S42;
words in the upper text word set and the lower text word set can be stored in a list mode, namely the upper text word set and the lower text word set can be an upper text word list and a lower text word list; similarly, the words in the description word set may be stored in a list, i.e., the description word set may be a description word list.
Entity information coincidence between the upper and lower text word sets of the entity reference and the description word sets of the corresponding candidate entities can be obtained by calculating the intersection sizes of the upper and lower text word sets and the description word sets of each candidate entity.
As a preferred embodiment, the entity names are above and below the input text, and the entity names can be located in the input text and are located behind each other. "divided middle portion, for example, the input text is" XXX ". XXXXXXXXXX, … … XXXXX entity is referred to as XXX … … XXX. XXX … … ", upper text is" entity designation "front to first". "XXXXXXXX, … … XXXXX"; and the text below is the "entity designation" last to first ". "XXX … … XXX", upper and lower texts are a superposition of upper and lower texts.
As a preferred embodiment, the text above and below the input text may be referred to by the entity as text of each preset length before and after the position in the input text, for example, the preset length may be 100 characters.
Further, in the above embodiment, the step S5 specifically includes the steps of:
step S51, calculating the length of the description information of the candidate entity;
calculating the editing times of the description information;
calculating the reference times of the description information;
step S52, obtaining the candidate entity richness score of the candidate entity according to the following formula:
C2=log(L1+(u/2)+(s/2));(2)
wherein, in the above formula (2), C2 is used to represent the candidate entity richness score;
l1 is used to represent the length of the description information of the candidate entity;
u is used for representing the editing times of the descriptive information;
s is used to indicate the number of references describing the information.
In the above embodiment, the description information of the candidate entity may be searched in a knowledge base, which includes, but is not limited to, a specific knowledge base corresponding to a scenario to which the method is applied, for example, an internet semantic knowledge base Wikipedia (Wikipedia), DBPedia, baiduBaike (Wikipedia), and the like;
the number of times the description information of the candidate entity is edited in the knowledge base is calculated, for example, the description information of one candidate entity is edited u times in hundreds of encyclopedias, and the number of times the description information is edited is u times.
The number of references of the description information is calculated, for example, the hundred degrees encyclopedia corresponding to the description information of one candidate entity is referred to s times, and the number of references of the description information is s times.
Further, in the above embodiment, the sum of the entity relevance score and the candidate entity richness score is taken as the entity score of the candidate entity.
Further, in the above embodiment, the step S7 specifically includes the steps of:
step S71, entity scores of each candidate entity are sequentially stored in an entity score storage list according to a preset arrangement sequence;
step S72, traversing each candidate entity in the entity score storage list, and storing the candidate entity with the entity score greater than a preset entity score threshold into an entity disambiguation list.
In the above embodiment, the preset arrangement order may be arranged according to the entity score from small to large; the preset arrangement sequence may be arranged according to the entity score from large to small.
In the above embodiment, it may be determined whether the number of candidate entities stored in the entity score storage list exceeds the to-be-entity disambiguation list threshold;
if yes, step S8 is directly executed.
As a preferred embodiment, as shown in figure 2,
the first step, each segmentation short sentence in the segmentation short sentence list is traversed;
step two, judging whether all entity names in the entity list are traversed;
if yes, exiting execution;
if not, acquiring the current entity name;
thirdly, calculating the entity relevance score of the entity finger according to the first appearance position and the appearance times of the entity finger in the input text and combining the text length of the input text;
detecting entity names corresponding to the entity relevance scores according to the entity relevance scores, and judging whether the entity names pass the detection;
if yes, acquiring a first text of the entity names passing through the detection in the input text;
if not, discarding entity names which do not pass the detection, and then returning to the second step;
step five, traversing each candidate entity in the corresponding candidate entity list;
step six, judging whether the current entity refers to all candidate entities in the corresponding candidate entity list;
if yes, executing an eleventh step;
if not, executing a seventh step;
seventh, acquiring a current candidate entity and acquiring a second text corresponding to the current candidate entity;
eighth step, calculating entity information coincidence degree between the first text and the second text; and
acquiring the candidate entity richness score of the current candidate entity by combining the entity information coincidence degree;
a ninth step of calculating an entity score of the current candidate entity according to the entity relevance score and the candidate entity richness score;
tenth, recording the entity scores of the current candidate entities into an entity score storage list, and returning to the sixth step;
eleventh step, traversing entity scores corresponding to each candidate entity in the entity score storage list;
twelfth, judging whether the current entity refers to the entity score corresponding to each candidate entity in the corresponding entity score storage list;
if yes, returning to the first step;
if not, executing a thirteenth step;
thirteenth, acquiring a current candidate entity, and judging whether the entity score of the current candidate entity is larger than a preset entity score threshold;
if yes, discarding the candidate entity corresponding to the entity score, setting the next candidate entity as the current candidate entity, and returning to the twelfth step;
if not, storing the candidate entity into a disambiguation list of the entity to be detected;
fourteenth step, judging whether the number of candidate entities in the disambiguation list of the entity to be detected exceeds a threshold value of the disambiguation list of the entity to be detected;
if yes, executing step S8;
if not, setting the next candidate entity as the current candidate entity, and returning to the twelfth step.
The foregoing is merely illustrative of the preferred embodiments of the present invention and is not intended to limit the embodiments and scope of the present invention, and it should be appreciated by those skilled in the art that equivalent substitutions and obvious variations may be made using the description and illustrations of the present invention, and are intended to be included in the scope of the present invention.

Claims (7)

1. A method of physical linking comprising the steps of:
dividing an input text to obtain a plurality of divided phrases, detecting the part of speech of each position word of each divided phrase, and splicing the divided phrases where each detected position word is located to obtain a cut text;
s11, dividing the input text according to punctuation marks to obtain a plurality of divided phrases;
step S12, sequentially storing all the segmentation phrases of the input text into a segmentation phrase list;
step S13, traversing the segmentation phrase list and acquiring the current segmentation phrases;
step S14, word segmentation is carried out on the current segmentation short sentence so as to obtain a plurality of words at each position corresponding to the current segmentation short sentence;
step S15, performing part-of-speech detection on each position word, splicing the current segmentation phrases of each position word subjected to part-of-speech detection to obtain a spliced text, judging whether the length of the spliced text exceeds a preset text length threshold, and if so, setting the spliced text as the cut text;
step S16, setting the next segmentation phrase as the current segmentation phrase, returning to step S14 until all segmentation phrases in the segmentation phrase list are traversed, so as to set the spliced text as the cut text;
acquiring at least one candidate entity corresponding to the entity index in the clipping text;
calculating according to the first appearance position and appearance times of the entity index in the input text and combining the text length of the input text to obtain an entity relevance score of the entity index, and detecting the entity index corresponding to the entity relevance score according to the entity relevance score;
acquiring a first text which is indicated in the input text by the detected entity, and calculating entity information coincidence degree between the first text and a second text corresponding to the candidate entity;
acquiring a candidate entity richness score of the candidate entity by combining the entity information coincidence degree;
calculating the entity score of the candidate entity according to the entity relevance score and the candidate entity richness score;
storing the candidate entity corresponding to the entity score meeting the preset condition into a disambiguation list of the entity to be detected;
filtering the candidate entities in the disambiguation list of to-be-entities, so that the entity references are linked to the filtered candidate entities.
2. The entity linking method of claim 1, wherein the entity relevance score for the entity designation is calculated by the following formula;
C3=(L2-I)/L2+log(c)/log(k);
wherein C3 is used to represent the entity relevance score;
l2 is used to represent the text length of the input text;
i is used to represent the first appearance position of the entity finger at the input text;
c is used for representing the occurrence times of the entity designation in the input text;
k is used to represent the adjustment coefficient.
3. The method of claim 1, wherein the detecting the entity designation corresponding to the entity relevance score according to the entity relevance score specifically includes:
judging whether the entity correlation score exceeds a preset entity correlation score threshold;
if yes, judging that the entity name corresponding to the entity relevance score passes detection;
if not, judging that the entity name corresponding to the entity relevance score fails to pass the detection.
4. The entity linking method according to claim 1, wherein the obtaining means refers to a first text in the input text through detection, calculates entity information coincidence between the first text and a second text corresponding to the candidate entity, and specifically includes the steps of:
the entity refers to an upper text and a lower text in the input text, the upper text and the lower text are segmented to correspondingly obtain an upper text word set and a lower text word set, and the upper text word set and the lower text word set are set as the first text;
acquiring description information of the candidate entity, and word segmentation is carried out on the description information to obtain a description word set, and the description word set is set as the second text;
and acquiring the entity information coincidence degree between the first text and the second text.
5. The method for entity linking according to claim 4, wherein said obtaining the candidate entity richness score of the candidate entity in combination with the entity information coincidence ratio specifically comprises the steps of:
calculating the length of the description information of the candidate entity;
calculating the editing times of the description information;
calculating the reference times of the description information;
obtaining the candidate entity richness score of the candidate entity according to the following formula:
C2=log(L1+(u/2)+(s/2));
wherein C2 is used to represent the candidate entity richness score;
l1 is used to represent the length of the description information of the candidate entity;
u is used for representing the editing times of the descriptive information;
s is used to indicate the number of references of the description information.
6. The entity linking method of claim 1 wherein a sum of the entity relevance score and the candidate entity richness score is taken as an entity score of the candidate entity.
7. The entity linking method according to claim 1, wherein the storing the candidate entity corresponding to the entity score satisfying the preset condition in the to-be-entity disambiguation list specifically includes the following steps:
sequentially storing the entity score of each candidate entity into an entity score storage list according to a preset arrangement sequence;
traversing the entity score corresponding to each candidate entity in the entity score storage list, and storing the candidate entities with entity scores greater than a preset entity score threshold value into the entity disambiguation waiting list.
CN202010915722.5A 2020-09-03 2020-09-03 Entity linking method Active CN112084777B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010915722.5A CN112084777B (en) 2020-09-03 2020-09-03 Entity linking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010915722.5A CN112084777B (en) 2020-09-03 2020-09-03 Entity linking method

Publications (2)

Publication Number Publication Date
CN112084777A CN112084777A (en) 2020-12-15
CN112084777B true CN112084777B (en) 2023-09-01

Family

ID=73732782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010915722.5A Active CN112084777B (en) 2020-09-03 2020-09-03 Entity linking method

Country Status (1)

Country Link
CN (1) CN112084777B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107316062A (en) * 2017-06-26 2017-11-03 中国人民解放军国防科学技术大学 A kind of name entity disambiguation method of improved domain-oriented
CN108959270A (en) * 2018-08-10 2018-12-07 新华智云科技有限公司 A kind of entity link method based on deep learning
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN109284392A (en) * 2018-12-07 2019-01-29 深圳前海达闼云端智能科技有限公司 Text classification method, device, terminal and storage medium
CN111027323A (en) * 2019-12-05 2020-04-17 电子科技大学广东电子信息工程研究院 Entity nominal item identification method based on topic model and semantic analysis
CN111339737A (en) * 2020-02-27 2020-06-26 北京声智科技有限公司 Entity linking method, device, equipment and storage medium
CN111428031A (en) * 2020-03-20 2020-07-17 电子科技大学 Graph model filtering method fusing shallow semantic information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9727842B2 (en) * 2009-08-21 2017-08-08 International Business Machines Corporation Determining entity relevance by relationships to other relevant entities

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107316062A (en) * 2017-06-26 2017-11-03 中国人民解放军国防科学技术大学 A kind of name entity disambiguation method of improved domain-oriented
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN108959270A (en) * 2018-08-10 2018-12-07 新华智云科技有限公司 A kind of entity link method based on deep learning
CN109284392A (en) * 2018-12-07 2019-01-29 深圳前海达闼云端智能科技有限公司 Text classification method, device, terminal and storage medium
CN111027323A (en) * 2019-12-05 2020-04-17 电子科技大学广东电子信息工程研究院 Entity nominal item identification method based on topic model and semantic analysis
CN111339737A (en) * 2020-02-27 2020-06-26 北京声智科技有限公司 Entity linking method, device, equipment and storage medium
CN111428031A (en) * 2020-03-20 2020-07-17 电子科技大学 Graph model filtering method fusing shallow semantic information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Entity Linking meets Word Sense Disambiguation: A Unified Approach;Andrea Moro;Transactions of the Association for Computational Linguistics;第2卷(第1期);231-244 *

Also Published As

Publication number Publication date
CN112084777A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
US10191892B2 (en) Method and apparatus for establishing sentence editing model, sentence editing method and apparatus
CN106844351B (en) Medical institution organization entity identification method and device oriented to multiple data sources
Vivaldi et al. Improving term extraction by system combination using boosting
US9600469B2 (en) Method for detecting grammatical errors, error detection device for same and computer-readable recording medium having method recorded thereon
Ljubešić et al. Standardizing tweets with character-level machine translation
CN107577663B (en) Key phrase extraction method and device
CN110929498B (en) Method and device for calculating similarity of short text and readable storage medium
CN109271524B (en) Entity linking method in knowledge base question-answering system
JP2008083952A (en) Dictionary creation support system, method and program
KR20160029587A (en) Method and apparatus of Smart Text Reader for converting Web page through TTS
CN112231451B (en) Reference word recovery method and device, conversation robot and storage medium
US8335681B2 (en) Machine-translation apparatus using multi-stage verbal-phrase patterns, methods for applying and extracting multi-stage verbal-phrase patterns
CN106021532B (en) Keyword display method and device
Alotaiby et al. New approaches to automatic headline generation for Arabic documents
KR101663038B1 (en) Entity boundary detection apparatus in text by usage-learning on the entity's surface string candidates and mtehod thereof
Conrado et al. Exploration of a rich feature set for automatic term extraction
CN112084777B (en) Entity linking method
Ehsan et al. A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection.
CN111368547A (en) Entity identification method, device, equipment and storage medium based on semantic analysis
Chiu et al. Chinese spell checking based on noisy channel model
CN113779983B (en) Text data processing method and device, storage medium and electronic device
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
Rofiq Indonesian news extractive text summarization using latent semantic analysis
JP2004046775A (en) Device, method and program for extracting intrinsic expression
Li et al. Chinese frame identification using t-crf model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20221221

Address after: Room 430, cultural center, 460 Wenyi West Road, Xihu District, Hangzhou City, Zhejiang Province, 310012

Applicant after: XINHUA ZHIYUN TECHNOLOGY Co.,Ltd.

Applicant after: Xinhua fusion media technology development (Beijing) Co.,Ltd.

Address before: Room 430, cultural center, 460 Wenyi West Road, Xihu District, Hangzhou City, Zhejiang Province, 310012

Applicant before: XINHUA ZHIYUN TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant