CN112001178A - Long-tail entity identification and disambiguation method - Google Patents

Long-tail entity identification and disambiguation method Download PDF

Info

Publication number
CN112001178A
CN112001178A CN202010875000.1A CN202010875000A CN112001178A CN 112001178 A CN112001178 A CN 112001178A CN 202010875000 A CN202010875000 A CN 202010875000A CN 112001178 A CN112001178 A CN 112001178A
Authority
CN
China
Prior art keywords
entity
long
candidate
entities
tail
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010875000.1A
Other languages
Chinese (zh)
Inventor
程良伦
张鸿彬
王德培
张伟文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202010875000.1A priority Critical patent/CN112001178A/en
Publication of CN112001178A publication Critical patent/CN112001178A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a long-tail entity identification and disambiguation method, wherein the long-tail entity disambiguation method comprises a process of carrying out candidate entity replacement on an identified long-tail entity. The invention can accurately and efficiently disambiguate the long-tail entity, can obviously improve the understanding of different references in the text, and can better track and acquire information.

Description

Long-tail entity identification and disambiguation method
Technical Field
The invention relates to the technical field of disambiguation methods.
Background
In natural language, word meaning, sentence meaning and chapter meaning layers have the phenomenon that languages are different according to context semantics, disambiguation is a process of determining object semantics according to context, and is one of core problems in natural language understanding.
Long-tailed entities refer to entities that have relatively few mentions (entries) in a large corpus of text. They are typically characterized by having no or only a limited profile of a conventional knowledge base, or by having only scarce resources outside of the knowledge base.
In the prior art, there are few means for identifying and disambiguating long-tail entities, and for example, when identifying a long-tail entity in a specific field, such as a scientific publication, by a semi-supervised method, it is necessary to find a corpus of the specific field and set related seeds, and then continuously increase the quality of the corpus and the seeds by an expansion and filtering mechanism, so as to identify the long-tail entity in the field.
Disclosure of Invention
The invention aims to provide a recognition and disambiguation method capable of accurately and efficiently disambiguating a long-tail entity, which can remarkably improve understanding of different references in a text and better track and acquire information.
The invention firstly provides the following technical scheme:
a method of identifying a long-tailed entity, comprising: and carrying out named entity recognition on the text, and screening out long-tail entities from the recognized entities through an entity linking tool.
In some embodiments, the screening comprises: if the identified entity does not appear in the entity linking tool, the summary description cannot be searched in the knowledge base, and the frequency of the identified entity appearing in the text does not exceed the frequency threshold, the entity is a long-tailed entity.
The invention further provides a disambiguation method of the long-tail entity, which comprises the following steps: and screening out the long-tail entities by the identification method, and replacing the screened long-tail entities by candidate entities.
The candidate entity refers to an entity existing outside the text.
In some embodiments, the replacing comprises:
obtaining a candidate entity set consisting of candidate entities;
obtaining a prior probability of the candidate entity to an internal entity;
obtaining a similarity between the contexts of the candidate entity and the internal entity;
obtaining consistency between the candidate entity and an entity in the text;
based on the prior probability, the context similarity and the context consistency, obtaining scores of candidate entities through machine learning, and replacing the candidate entities with the highest scores obtained through a learning sorting algorithm with the internal entities;
wherein the internal entity is an entity name corresponding to the candidate entity in the long-tail entity;
the context similarity comprises a weighted vector cosine similarity between the candidate entity and a context entity;
the context consistency includes a mean of vector cosine similarities of the candidate entity to all entities within the text.
The context may set a range of 100 words as the first 50 words and the last 50 words of the internal entity.
Preferably, the vectors are obtained by a word vector model such as a skip-gram, the corpus of models comprising all entities obtained by named entity recognition and other non-entity words, and the internal entities.
In some embodiments, the cosine similarity of the weighted vector is obtained by multiplying a candidate entity vector by a context feature vector, the context feature vector is obtained by multiplying a context entity by a weight matrix thereof, and a weight value corresponding to an entity labeled as a compound word and a noun by dependency syntax analysis in the weight matrix is greater than weight values of other words.
In some embodiments, the weighted vector cosine similarity is obtained by a local score obtained by a local model including a self-attention mechanism.
In some embodiments, the obtaining of the internal entity comprises:
performing part-of-speech and relation analysis on the long-tail entity through dependency syntax analysis, and labeling;
if a compound word (compound) exists in the label, then:
cutting the compound word entity;
cutting the parts except the compound word entity in sequence;
removing a portion labeled as a lattice-labeled word (case);
if the label does not have the compound word, the following steps are carried out:
segmenting the long-tail entity according to words marked as lattice marks;
cutting the segmented parts in sequence;
removing the part marked as the lattice mark word;
and forming an internal entity set by the cutting entity after removing the lattice mark words.
In some embodiments, the entity tailoring comprises:
segmenting the part to be cut according to a certain number n of words;
retrieving the segmented part in a knowledge base;
if the relevant summary exists in the search, the summary is a cutting segment;
if the retrieval does not have related outlines, reducing the number n of words, segmenting, and performing retrieval and judgment on the part after secondary segmentation until all cutting segments are obtained;
and cutting the part to be cut according to the cutting section sequence.
Preferably, the entity clipping is realized by an n-gram method.
In some embodiments, the candidate entity is a similar entity of the internal entity within a knowledge base, wherein the similarity includes abbreviation matching and/or string similarity.
Preferably, the string similarity is cosine similarity of string bag-of-words vectors.
Preferably, the string bag-of-word vector may be obtained by training a bag-of-word model (BOW model).
In some embodiments, the prior probability is obtained by:
wherein the prior probability is obtained by:
P(e|m)=|Ae,m|/|A*,m|
where m denotes a certain entity designation, e denotes an entity to which the prior probability calculation is performed this time, that is, a candidate entity, and P (e | m) denotes a probability of occurrence of the candidate entity e in a case where the entity designation m occurs.
|A*,mL represents the number of anchors (anchors) having the same surface (surface) as the entity designation m in the knowledge base or its dumped knowledge base. Specifically, all and entities in a knowledge base such as Wikipedia refer to entities (co-occurring entities) that appear in the same page, i.e., the surface, directly or after passing through a hyperlink, and anchors refer to hyperlinks that link to the same page.
|Ae,mAnd | represents the number of anchors with the same surface where candidate entity e co-occurs with the entity designation.
In some embodiments, the disambiguation method further comprises restoring the lattice-tagged word in place after replacing all internal entities.
The invention has the following beneficial effects: the disambiguation method can process long-tail entities with relatively long length, can accurately identify the long-tail entities and can accurately disambiguate.
The method can identify the long-tail entity in the industrial process text through an entity link system and a Stanford tool, utilizes dependency syntactic analysis to assist in cutting the long-tail entity, utilizes a machine learning method to obtain the final score of the candidate entity pointed by the cut entity in combination with the self-attention embedded context characteristics aiming at the cut long-tail entity, links the candidate entity with the highest score to the corresponding cut entity for replacement, and then sequentially and circularly links until the internal entity of the long-tail entity is completely linked.
Aiming at the long-tail entity, the dependency syntactic analysis and the n-gram method are utilized to assist in cutting the long-tail entity, so that the disambiguation replacement can be carried out on the internal entities one by one, and the disambiguation accuracy of other internal entities can be improved. And (4) passing.
The invention can fully utilize the information of the related entities in the long-tail entity through a self-attention mechanism.
Drawings
Fig. 1 is a flow chart illustrating a method for identifying and disambiguating according to an embodiment of the present invention.
Fig. 2 is a process diagram of the self-attention mechanism incorporating the embedded feature according to the embodiment of the present invention.
Fig. 3 is a diagram illustrating a result of dependency parsing according to embodiment 1 of the present invention.
FIG. 4 is a schematic diagram of the structure and processing procedure of deep-ed model in embodiment 1 of the present invention.
Detailed Description
The invention is described in detail below with reference to the attached drawings, but it should be understood that the drawings are only for illustrative purposes and do not limit the scope of the invention. All reasonable variations and combinations that fall within the spirit of the invention are intended to be within the scope of the invention.
Processing an industrial process text containing a long-tail entity through a process shown in the attached figure 1, wherein the process comprises long-tail entity identification and long-tail entity disambiguation, and the process specifically comprises the following steps:
s1 Long-tailed entity recognition
More specifically, it may include:
s10 performs entity recognition on the industrial process text through a natural language processing tool, such as StanfordCoreNLP tool (StanfordCoreNLP), to obtain all entities including the long-tailed entity.
S11 links all entities identified in the text through an entity linking tool, such as a Tagme tool, and screens out long-tail entities according to the link condition.
The screening process comprises the following steps:
if the entity identified by S10 does not appear in the link of S11, and it cannot search for a relevant summary description about the entity in some common knowledge bases, such as Wikipedia (i.e., the lookup cannot), and the frequency of the entity appearing in the text does not exceed the frequency threshold, such as 10 times, then the entity is a long-tailed entity.
S2 Long-tailed entity disambiguation
More specifically, it may include:
s20, the recognized long-tail entity is preprocessed to obtain a clipped entity name set.
Preprocessing includes dependency parsing and part-of-speech analysis, such as:
s201, through a dependency syntax analysis tool, such as StanfordCoreNLP, dependency syntax analysis is carried out on the identified long-tail entity, and parts of speech in the long-tail entity are labeled.
After the dependency parsing, whether there is a distinction of a word labeled "compound" or not is processed as follows:
S202A, after performing dependency parsing, when there is a word labeled "compound" in the long-tailed entity, clipping the entity labeled "compound", that is, compound word, wherein when the compound word contains a word labeled "amod", that is, adjective word before the compound word, the compound word entity should include the compound word and the adjective word before the compound word.
The cutting can be concretely realized by:
and (3) segmenting the compound word entity by using an n-gram method, searching the segmented entity in a knowledge base, if the retrieval can obtain the relevant information of the segmented entity, the segmented entity is a cutting segment, if the relevant information of the segmented entity can not be obtained, further segmenting the segmented entity to reduce the number of words in each segment, then searching the knowledge base to confirm whether the segmented entity is the cutting segment, and after all the cutting segments are confirmed, sequentially cutting the compound word entity according to the cutting segment sequence to obtain the cut entity.
In the first segmentation, n is set as the number of words in the longest compound word entity, and then the number of n is reduced by 1 in each segmentation.
The knowledge base used may be, for example, Wikipedia or the like.
S203A, cutting the rest part outside the compound word entity by the same process as the n-gram method to obtain the cut entity.
S204A removes words labeled as "case" in the long-tailed entity through dependency parsing, i.e., lattice-labeled words, to reduce noise.
And forming the entity designation set by the clipped entity without the case words, namely the clipped entity obtained by the compound word entity and the clipped entity obtained by the rest part.
S202B, after dependency syntactic analysis, when the long-tail entity does not contain the word labeled as "compound", segmenting the word labeled as "case" in the long-tail entity, retrieving and cutting the segmented word through the same process as the n-gram method to obtain a cut entity, and then removing the word labeled as "case" in the long-tail entity through dependency syntactic analysis to reduce noise and obtain the entity name set.
S21 obtains a candidate entity set from the clipped entity designations.
It may further comprise:
and respectively obtaining other entities with similarity in a knowledge base such as Wikipedia through the cut different entity indexes, and forming a candidate entity set of the entity index by the other entities.
The Wikipedia comprises a Wikipedia redirection page of an entity designation, a Wikipedia disambiguation page of the entity designation and a hyperlink name, which are searched through a URL and a page search box, other entities which have a common link with the entity designation, and also comprises a dump file which is occasionally provided by the Wikipedia, namely the dumped file, wherein the dump file also comprises the redirection page, the disambiguation page and an anchor, namely the hyperlink.
Similarity includes similarity of abbreviations and similarity of character strings.
The abbreviation similarity is completely matched with a character string of a public abbreviation list, the character string similarity is cosine similarity of a character string bag vector, and when the cosine similarity of the bag vector is larger than 50%, the bag vector is considered to be similar. The string bag-of-word vector may be obtained through bag-of-word model (BOW model) training.
S22 obtains a prior probability of a candidate entity corresponding to the entity designation.
Wherein the prior probability is obtained by:
P(e|m)=|Ae,m|/|A*,m|
where m denotes a certain entity designation, e denotes an entity to which the prior probability calculation is performed this time, that is, a candidate entity, and P (e | m) denotes a probability of occurrence of the candidate entity e in a case where the entity designation m occurs.
|A*,mL represents the number of anchors (anchors) having the same surface (surface) as the entity designation m in the knowledge base or its dumped knowledge base. Specifically, all and entities in a knowledge base such as Wikipedia refer to entities (co-occurring entities) that appear in the same page, i.e., the surface, directly or after passing through a hyperlink, and anchors refer to hyperlinks that link to the same page.
|Ae,mAnd | represents the number of anchors with the same surface where candidate entity e co-occurs with the entity designation.
S23, training the word vector model through all the entities and other words obtained through S10 and the clipped entity names obtained through S20 to obtain a common vector space.
Wherein the word vector model may be selected as the extended skip-gram model.
The common vector space may be arranged as 300 dimensions.
S24 obtains the similarity of the contexts of the candidate entities and their corresponding entity designations and the consistency between the candidate entities and the text entities based on the obtained candidate entity set and the common vector space.
The context in which an entity refers can be set to a range of 100 words including, for example, the 50 words before the entity refers and the 50 words after the entity refers.
Specifically, the similarity and consistency are obtained by establishing local and global models containing a self-attention mechanism.
The process of adding the self-attention mechanism can be shown in fig. 2.
More specifically, the method may further include:
s241, local modeling is carried out, and a local score of context similarity of the candidate entity and the entity is obtained.
A specific Local model may use the Local part of the Deep-ed model (i.e. its Local model) as in the document Deep Joint Entity visualization with Local Neural Attention (Octaven-Eugen Ganea, Thomas Hofmann (EMNLP2017)), or similar models. The model can generate context feature vectors of the entity through attention, and then multiply the vectors of the candidate entities to obtain local scores.
More specifically, the invention can add a self-attention mechanism through a soft attention layer (soft attention) of the model, and endow greater vector weight to words which have compound word entity relationship and are part of speech labeled as nouns through dependency syntax analysis, such as arranging 300 × 1 vectors obtained by converting each above word in sequence to obtain a 300 × N word vector matrix, wherein N represents the total number of context words, and multiplying the word vector matrix by an N × 300 weight matrix with different weight assignments to obtain a weighted context feature vector.
Wherein the assignment of the weight matrix can be set as: the values of all the rows in the weight matrix corresponding to the words labeled as compound word entities and having part of speech as nouns are 2 (i.e., a 300-dimensional matrix row with all the values of 2 is obtained), and all the values of the other rows are 1.
S242 obtains a global score for evaluating consistency of the candidate entity with the entity in the text
Consistency means that all entities in the same text are based on a theme, that is, theme consistency exists, and for example, all entities in a certain text describe the theme of a steam turbine in an industrial process.
The global score may be set as the sum average of cosine similarities between the vector of the candidate entity and the vectors of all entities in the text, such as:
Figure BDA0002652360460000081
wherein, XiAnd the vector cosine similarity of a candidate entity and an entity in the text is represented, and n represents the number of all entities in the text.
S25 alternative disambiguation
And splicing the prior probability of the candidate entity obtained through S22, the local score obtained through S241 and the global score obtained through S242, and inputting the spliced results as features into a feedforward neural network with the hidden size of 100, and outputting a final score, wherein the final score can be obtained through two fully-connected layers, and the candidate entity with the highest obtained score is the replaceable entity closest to the entity designation.
The feedforward neural network is trained by a supervised learning method, such as using multi-classification Hinge loss (multiMarginLoss) as a loss function, and repeatedly updating model parameters by reducing loss through back propagation until the training is completed.
And generating final scores of the candidate entities through the trained model, obtaining the candidate entity with the highest score through the score ordering of different candidate entities, linking the candidate entity with the highest score to the corresponding entity designation, and performing link replacement on the entity designation, namely realizing entity disambiguation.
S26 Long-tailed entity disambiguation
Disambiguation of the entity designation is completed once through S25, then disambiguation is performed on other entity designations cut in the long-tailed entity according to the process of S24-S25, and after all entity designations are disambiguated in sequence, the word marked as "case" removed in S203 is restored according to positions, and the long-tailed entity disambiguation is completed.
Example 1
Long-tailed entity disambiguation of a text passage in an industrial process text, comprising:
and (3) long-tail entity identification:
firstly, entity recognition is carried out on the text through a Stanford Nlp tool, and all entities in the text are recognized.
All entities are linked through the tag me, wherein the links do not appear, related summary descriptions cannot be searched in Wikipedia, and the entities with the frequency of appearance not more than 10 times in the text are judged as long-tailed entities.
The following long-tail entities can be obtained through the above process:
(1)“CAD of Management Software for Mould”。
(2)“Programmable Logic Controller for Elevator”。
(3)“PLC for Elevator”。
(4)“Subsidiary Company of General Electric”。
the dependency parsing and the part-of-speech analysis are performed on the long-tail entities, and taking the first long-tail entity as an example, the dependency parsing is shown in fig. 3, and it can be seen that:
in this entity, the word Management Software is labeled compound, of and for are labeled case, and if there is no other adjective before the compound, the compound entity is Management Software, and the word quantity n is 2.
Firstly, searching Management Software through Wikipedia, and displaying the result that the compound word entity has a summary description page, wherein the Management Software is a cutting section, and according to the cutting section, the long-tail entity is firstly divided into CADOf, Management Software and for Mould.
Similarly, the long-tailed entity "Programmable Logic Controller for Elevator" is subjected to dependency syntax analysis and part-of-speech tagging, wherein the word tagged as compound is Logic Controller, the word tagged as case is for, and the compound word Logic Controller contains the adjective Programmable tagged as amod, and then the complete compound entity is Programmable Logic Controller, and the word quantity n is 3. The long-tail entity is firstly divided into a programable Logic Controller and a for Elevator by searching a programable Logic Controller through a Wikipedia, and the result shows that a summary description page of the long-tail entity exists.
Similarly, the long-tailed entity "PLC for Elevator" is subjected to dependency syntactic analysis and part-of-speech analysis, which finds no word labeled compound, and is divided by a word labeled case, which is first divided into PLC, for, and Elevator.
Similarly, dependency parsing and part-of-speech tagging are performed on the long-tailed entity "basic with Company of General Electric", where the word tagged as compound is General Electric, with and of is tagged as case, and the compound entity word quantity n is 2. The General Electric was searched by Wikipedia and its associated summary description appeared. The long-tailed entity was therefore first classified as a popular with Company of General Electric.
The remaining portions are then similarly trimmed as follows:
the remaining part Of the long-tail entity (1) comprises CAD Of for Mould, the CAD Of and for Mould are searched through Wikipedia, no summary description exists, so the number n Of search words is respectively reduced to 3 and 1, namely the CAD, for and Mould are searched in sequence, and the result displays the related summary description Of the CAD, for and Mould, so the long-tail entity is finally cut into CAD, Of and Management Software, for and Mould.
Similarly, the long-tail entity (2) for Elevator cannot be searched by Wikipedia, so the number n of search words is reduced to 1, i.e. the for and Elevator are searched in sequence, and the result displays the related summary description of the for and Elevator. Therefore, the long-tail entity is finally tailored into a Programmable Logic Controller, for, and Elevator.
Similarly, the PLC and the Elevator in the long-tail entity (3) are searched by Wikipedia, and the results show the relevant summary description of the PLC and the Elevator. Therefore, the long-tail entity is finally cut into PLC, for and Elevator.
Similarly, the presence of. The long-tailed entity was therefore eventually tailored to be Subsidialary, with, Company, of, General Electric.
Finally, discarding the words labeled as case in each long-tail entity, for example, discarding "of" and "for" in "CAD of Management Software for Mould", combining with the cutting of the long-tail entity, and finally obtaining the entity names of the long-tail entity, including "CAD", "Management Software" and "Mould", three in total.
Similarly, "for" is discarded in "Programmable Logic Controller for Elevator," and the entity that finally gets the long-tailed entity refers to "Programmable Logic Controller" and "Elevator," both.
Similarly, "for" is discarded in "PLC for Elevator", and the entity that finally gets the long-tailed entity is referred to as "PLC" and "Elevator", both.
Similarly, "with" and "of" are discarded in "with Company of General Electric", and the resulting entity for the long-tailed entity is referred to as "with", "Company", "General Electric", for three.
The part-of-speech analysis, such as CAD of Management Software for Mould, wherein CAD, Management, Software and Mould are nouns, and preparation is provided for the following local modeling.
Similarly, "Subsinear", "Company", "General", "Electric" in "Subsinear Company of General Electric" are all terms, ready for later local modeling.
The obtained entity designation generates a candidate entity list by searching related Wikipedia information and matching abbreviations and character string similarity, for example, according to the entity designation 'CAD' of the long-tail entity (1), the following 3 candidate entities can be obtained: "Computer-aided design", "currentcode of the Canadian doller" and "an enzyme-encoding gene".
Obtaining | A by preprocessing data dependent code programs in Deep-ed model*,mI and | Ae,mThe | value, and calculate the prior probability of the candidate entity based on the entity designation as follows:
P(e|m)=|Ae,m|/|A*,m|。
taking the entity designation ' CAD ' of the long-tailed entity (1) as an example, the candidate entity ' Computer-aided design ' is in Wikipedia, and the number of anchors with the same surface, namely | A of the anchors, co-occur with the entity designation ' CADe,m468, and in Wikipedia, the total number of anchors with the same surface as the entity referred to as "CAD", i.e., its | A*,mI is 513, so the probability prior of the candidate entity "Computer-aided design" based on the entity's designation "CAD" is 0.912.
Similarly, the probability prior of obtaining the candidate entity "currentcode of the Canadian doller" based on the entity designation "CAD" is 0.893, and the probability prior of obtaining the candidate entity "an enzyme-encoding gene" based on the entity designation "CAD" is 0.914.
Training all words, all entities and all entity names after cutting of long-tail entities in the industrial process text together through the extended skip-gram model, jointly mapping the words and the entities to the same continuous vector space, and setting the vector space dimension of the model to be 300.
And obtaining the context similarity between the candidate entity and the entity designation through local modeling, and obtaining the local score of the candidate entity and the entity designation. Wherein the local model is Deep-ed model as shown in fig. 4, and the input dimension is 300. In the context of the entity, if "Management" and "Software" are composite entities and both parts of speech are labeled as nouns, the numerical value of each dimension of the 300-dimensional input weight corresponding to the two words is 2, and the other weights are 1.
And (4) obtaining cosine similarity between the candidate entity and other entities in the text segment through global calculation, and obtaining a consistency score of the candidate entity and other entities.
And splicing the obtained local score, the global score and the prior probability according to the direction of the last one-dimensional column (concatenate), and performing feedforward neural network with the input size of 3, the hidden layer dimension of 100 and the output dimension of 1. Splicing is realized by using a cat function in a deep learning framework torch, and a neural network is trained by a method of random gradient descent in supervised learning.
And inputting the entity designation and the candidate entity thereof into the model for screening through the trained model.
For example, in the long-tail entity "CAD of Management Software for Mould", the candidate entity with the highest scoring named "CAD" is named "Computer-aided design", and the entity after cutting is replaced by the candidate entity named "CAD".
Similarly, the highest-scoring candidate entity that gets the entity designation "Management Software" in the long-tailed entity "CAD of Management Software for Mould" is "Project Management Software", and its replacement for the trimmed entity is referred to as "Management Software".
After the substitution of all entity designations is completed, supplementing the substituted entities with the parts marked as cases in the syntactic dependency analysis and part-of-speech analysis, and obtaining the following texts:
the Long-tailed entity "CAD of Management Software for Mould" is finally disambiguated as "Computer-aided design of Project Management Software for Molding".
It can be seen that in the final text obtained through the above process, each entity has a clear and accurate definition, and the disambiguation effect is good.
The above examples are merely preferred embodiments of the present invention, and the scope of the present invention is not limited to the above examples. All technical schemes belonging to the idea of the invention belong to the protection scope of the invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention, and such modifications and embellishments should also be considered as within the scope of the invention.

Claims (10)

1. A method for identifying a long-tail entity is characterized by comprising the following steps: the method comprises the following steps: and carrying out named entity recognition on the text, and screening out long-tail entities from the recognized entities through an entity linking tool.
2. The identification method according to claim 1, characterized in that: the screening comprises the following steps: if the identified entity does not appear in the entity linking tool, the summary description cannot be searched in the knowledge base, and the frequency of the entity appearing in the text does not exceed the frequency threshold, the entity is a long-tailed entity.
3. A disambiguation method for long-tailed entities, comprising: the method comprises the following steps: screening long-tail entities by the identification method according to any one of claims 1 or 2, and replacing the screened long-tail entities with candidate entities.
4. A disambiguation method according to claim 3, characterised in that: the replacing comprises the following steps:
obtaining a candidate entity set consisting of candidate entities;
obtaining a prior probability of the candidate entity to an internal entity;
obtaining a similarity between the contexts of the candidate entity and the internal entity;
obtaining consistency between the candidate entity and an entity in the text;
based on the prior probability, the similarity between the contexts and the consistency, obtaining the scores of candidate entities through machine learning, and replacing the candidate entity with the highest score with the internal entity;
wherein the internal entity is an entity name corresponding to the candidate entity in the long-tail entity;
the context similarity comprises a weighted vector cosine similarity between the candidate entity and a context entity;
the context consistency includes a mean of vector cosine similarities of the candidate entity to all entities within the text.
5. The disambiguation method of claim 4, further comprising: the cosine similarity of the weighting vector is obtained by multiplying a candidate entity vector by a context feature vector, the context feature vector is obtained by multiplying a context entity by a weight matrix thereof, and the weight value corresponding to the entity which is marked as a compound word and is a noun by dependency syntactic analysis in the weight matrix is greater than the weight values of other words.
6. The disambiguation method of claim 4, further comprising: the obtaining of the internal entity comprises:
performing part-of-speech and relation analysis on the long-tail entity through dependency syntax analysis, and labeling;
if the compound word exists in the label, the following steps are carried out:
cutting the compound word entity;
cutting the parts except the compound word entity in sequence;
removing the part marked as the lattice mark word;
if the label does not have the compound word, the following steps are carried out:
segmenting the long-tail entity according to words marked as lattice marks;
cutting the segmented parts in sequence;
removing the part marked as the lattice mark word;
and forming an internal entity set by the cutting entity after removing the lattice mark words.
7. The disambiguation method of claim 6, further comprising: the cutting comprises the following steps:
segmenting the part to be cut according to a certain number n of words;
retrieving the segmented part in a knowledge base;
if the relevant summary exists in the search, the summary is a cutting segment;
if the retrieval does not have related outlines, reducing the number n of words, segmenting, and performing retrieval and judgment on the part after secondary segmentation until all cutting segments are obtained;
and cutting the part to be cut according to the cutting section sequence.
8. The disambiguation method of claim 4, further comprising: the candidate entity is a similar entity of the internal entity in a knowledge base, wherein the similarity comprises abbreviation matching and/or character string similarity; preferably, the string similarity is cosine similarity of string bag-of-words vectors.
9. The disambiguation method of claim 4, further comprising: the prior probability is:
P(e|m)=|Ae,m|/|A*,m|,
where m denotes a certain entity designation, e denotes a candidate entity, | A*,mI represents the number of anchors in the knowledge base or its dumped knowledge base having the same surface as the entity's designated m, | Ae,mL represents the number of anchors that candidate entity e has the same surface as the entity reference.
10. A disambiguation method according to any one of claims 6-9, characterised in that: it also includes restoring the lattice-labeled words by position after replacing all internal entities.
CN202010875000.1A 2020-08-27 2020-08-27 Long-tail entity identification and disambiguation method Pending CN112001178A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010875000.1A CN112001178A (en) 2020-08-27 2020-08-27 Long-tail entity identification and disambiguation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010875000.1A CN112001178A (en) 2020-08-27 2020-08-27 Long-tail entity identification and disambiguation method

Publications (1)

Publication Number Publication Date
CN112001178A true CN112001178A (en) 2020-11-27

Family

ID=73470415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010875000.1A Pending CN112001178A (en) 2020-08-27 2020-08-27 Long-tail entity identification and disambiguation method

Country Status (1)

Country Link
CN (1) CN112001178A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464669A (en) * 2020-12-07 2021-03-09 宁波深擎信息科技有限公司 Stock entity word disambiguation method, computer device and storage medium
CN112989804A (en) * 2021-04-14 2021-06-18 广东工业大学 Entity disambiguation method based on stacked multi-head feature extractor

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090276437A1 (en) * 2008-04-30 2009-11-05 Microsoft Corporation Suggesting long-tail tags
US20150356127A1 (en) * 2011-02-03 2015-12-10 Linguastat, Inc. Autonomous real time publishing
CN105224648A (en) * 2015-09-29 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of entity link method and system
CN109446300A (en) * 2018-09-06 2019-03-08 厦门快商通信息技术有限公司 A kind of corpus preprocess method, the pre- mask method of corpus and electronic equipment
CN109858018A (en) * 2018-12-25 2019-06-07 中国科学院信息工程研究所 A kind of entity recognition method and system towards threat information
WO2019174422A1 (en) * 2018-03-16 2019-09-19 北京国双科技有限公司 Method for analyzing entity association relationship, and related apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090276437A1 (en) * 2008-04-30 2009-11-05 Microsoft Corporation Suggesting long-tail tags
US20150356127A1 (en) * 2011-02-03 2015-12-10 Linguastat, Inc. Autonomous real time publishing
CN105224648A (en) * 2015-09-29 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of entity link method and system
WO2019174422A1 (en) * 2018-03-16 2019-09-19 北京国双科技有限公司 Method for analyzing entity association relationship, and related apparatus
CN109446300A (en) * 2018-09-06 2019-03-08 厦门快商通信息技术有限公司 A kind of corpus preprocess method, the pre- mask method of corpus and electronic equipment
CN109858018A (en) * 2018-12-25 2019-06-07 中国科学院信息工程研究所 A kind of entity recognition method and system towards threat information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SEPIDEH MESBAH,ET AL.: "TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications", ISWC 2018:THE SEMANTIC WEB – ISWC 2018, 18 September 2018 (2018-09-18), pages 127 - 143, XP047487927, DOI: 10.1007/978-3-030-00671-6_8 *
张晓娟;彭琳;李倩;: "查询推荐研究综述", 情报学报, no. 04, 24 April 2019 (2019-04-24), pages 102 - 116 *
李禹恒;宋俊;黄宇;付琨;吴一戎;陈昊;: "基于微博文本的层次化实体链接方法", 吉林大学学报(工学版), no. 03, 15 May 2016 (2016-05-15), pages 225 - 231 *
邓博研,程良伦: "基于ALBERT的中文命名实体识别方法", 计算机科学与应用, vol. 10, no. 5, 12 May 2020 (2020-05-12), pages 883 - 892 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464669A (en) * 2020-12-07 2021-03-09 宁波深擎信息科技有限公司 Stock entity word disambiguation method, computer device and storage medium
CN112464669B (en) * 2020-12-07 2024-02-09 宁波深擎信息科技有限公司 Stock entity word disambiguation method, computer device, and storage medium
CN112989804A (en) * 2021-04-14 2021-06-18 广东工业大学 Entity disambiguation method based on stacked multi-head feature extractor
CN112989804B (en) * 2021-04-14 2023-03-10 广东工业大学 Entity disambiguation method based on stacked multi-head feature extractor

Similar Documents

Publication Publication Date Title
CN107704892B (en) A kind of commodity code classification method and system based on Bayesian model
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
Yu et al. Resume information extraction with cascaded hybrid model
CN108446271B (en) Text emotion analysis method of convolutional neural network based on Chinese character component characteristics
CN107315738B (en) A kind of innovation degree appraisal procedure of text information
CN109684642B (en) Abstract extraction method combining page parsing rule and NLP text vectorization
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN108509423A (en) A kind of acceptance of the bid webpage name entity abstracting method based on second order HMM
Metwally et al. A multi-layered approach for Arabic text diacritization
CN101901213A (en) Instance-based dynamic generalization coreference resolution method
CN112001178A (en) Long-tail entity identification and disambiguation method
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN115017903A (en) Method and system for extracting key phrases by combining document hierarchical structure with global local information
CN113361252B (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
CN106021413B (en) Auto-expanding type feature selection approach and system based on topic model
Ezhilarasi et al. Depicting a Neural Model for Lemmatization and POS Tagging of words from Palaeographic stone inscriptions
CN114048314A (en) Natural language steganalysis method
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN110705306A (en) Evaluation method for consistency of written and written texts
CN113095087B (en) Chinese word sense disambiguation method based on graph convolution neural network
CN115481636A (en) Technical efficacy matrix construction method for technical literature
Chowdhury et al. Detection of compatibility, proximity and expectancy of Bengali sentences using long short term memory
CN110472243B (en) Chinese spelling checking method
JP3889010B2 (en) Phrase classification system, phrase classification method, and phrase classification program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination