CN105912625A - Linked data oriented entity classification method and system - Google Patents

Linked data oriented entity classification method and system Download PDF

Info

Publication number
CN105912625A
CN105912625A CN201610213411.8A CN201610213411A CN105912625A CN 105912625 A CN105912625 A CN 105912625A CN 201610213411 A CN201610213411 A CN 201610213411A CN 105912625 A CN105912625 A CN 105912625A
Authority
CN
China
Prior art keywords
entity
classification
class
physical page
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610213411.8A
Other languages
Chinese (zh)
Other versions
CN105912625B (en
Inventor
葛涛
穗志方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201610213411.8A priority Critical patent/CN105912625B/en
Publication of CN105912625A publication Critical patent/CN105912625A/en
Application granted granted Critical
Publication of CN105912625B publication Critical patent/CN105912625B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a linked data oriented entity classification method and system which are aimed at the problem of entity classification of linked data. The entity classification method includes pretreatment, statistical classification and post treatment. Pretreatment: word segmentation is performed on text description information in an entity page; and an attribute name of an information box and word information obtained by segmentation form an entity page character. Statistical classification: a statistical classification model is trained through various segmentation granularities to classify the entity page, and then a primary prediction result of entity class can be obtained. Post treatment: the entity statistical classification result is corrected; and a combined entity class is corrected through model combination, language knowledge, linkage information and class associate attribute information. The method and the system is easy to implement and debug, is high in efficiency, is good in accuracy, can be used for performing knowledge management on the linked data, and achieve high-precision classification of the entity.

Description

A kind of entity classification method and system towards link data
Technical field
The invention belongs to field of information processing, relate to linking data classification and search, particularly relate to a kind of towards link number Physical page according to carries out the method and system of high exact classification.
Background technology
It is in big data age at present, how to maximally utilise data to help computer to carry out information processing Become the research topic that current information process field is the most popular.In recent years, along with the arrival in Web2.0 epoch, link data (such as semantic net, knowledge mapping etc.), because its powerful relationship description ability, has obtained the extensive concern of people.Link data Refer to as Baidupedia, the data organizational form of wikipedia, in this data, the corresponding entity of each page, inter-entity There is mutual link, be therefore referred to as linking data (linked data).Along with the continuous increase of data scale, use artificial Management by methods link data are the most unrealistic, in the urgent need to carrying out link data the high efficiency method of information management and be System.
The entity classification of link data is an important technological problems of link data knowledge management domain, for link number According to carrying out entity classification, it is possible to substantial amounts of physical page in organizing links data effectively, thus strengthen user's search and read Experience.
At present, the common method of entity classification is that the description text for entity is classified.But, this simple side Method can not analyze the classification of entity under many circumstances exactly, and its deficiency is mainly manifested in:
(1) for people, judge that entity class is a thing easily although describing according to text, but For the statistical classification method being currently based on feature, it is desirable to height is described by text accurately and judges that entity class is not existing Real;Such as, text " X is the animation according to famous game reorganization " and " A is the game according to famous cartoon making " are in vocabulary level Do not have a closely similar expression, but the former is the description to an animation entity and the latter is the description to game entity, Its entity type described is entirely different.Therefore, the statistical classification method accuracy of identification being based purely on text feature is not enough, not Entity class can be obtained accurately.
(2) a lot of physical page do not have enough texts and describe information, in this case, utilize merely text to describe Entity is classified by information, inevitably results in classification error, is described by text and cannot obtain entity class.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the present invention provide a kind of towards link data entity classification method and System, for the entity classification problem of link data, reaches high precisely entity by statistical classification process and last handling process The purpose of classification;Wherein, statistical classification process is by classifying for text message modeling;Last handling process utilizes abundant The result of entity statistical classification is modified, including Model Fusion, language by resource (information such as such as affixe information, link data) Say knowledge, link information and utilize category associations attribute information to methods such as the entity class after merging are modified.
Physical page in link data generally comprises text and describes and message box (infobox).Text is retouched by the present invention Stating after carrying out cutting, word information message box (infobox) attribute-name obtained together with cutting as feature extraction out, is made Character representation for physical page;Then, physical page utilize maximum entropy model use multiple cutting granularity to classify, To the tentative prediction to entity class;Again obtained entity class is post-processed, to verify that its classification results whether may be used Lean on;Post processing specifically includes the classification results of the grader to the features training utilizing different cutting granularity and merges;Utilize The category associations obvious prediction error of attribute information correction in category attribute data Kuku;Text is described first sentence and carries out the degree of depth Understand, utilize the methods analyst sentence structures such as syntactic analysis, obtain entity class information, predicting the outcome before revising;Excellent Selection of land, is also with the classification that puzzled matrix identification is difficult to correctly classify, and carries out for the prediction of classification being difficult to correctly classify Further checking, is modified entity class including the classification of the adjacent page using physical page to be linked and uses entity Entity class is modified by the affixe information of the page.
Present invention provide the technical scheme that
A kind of entity classification method towards link data, described link data are multiple physical page, described physical page Bread describes and message box containing text;Described entity classification method includes pretreatment stage, statistical classification stage and post processing rank Section, specifically includes following steps:
1) in pretreatment stage process, carrying out participle by the text in physical page is described information, cutting obtains word Information;The feature of physical page it is made up of the attribute-name of message box and described word information;
2) in the statistical classification stage, utilize the feature of described physical page, use multiple cutting granularity to train statistical Physical page is classified by class model, obtains the tentative prediction result of entity class;
3) in post-processing stages, the tentative prediction result of entity class is modified, obtains revised entity classification Classification;Described correction comprises the steps:
31) by many granularity models fusion method, the statistical classification model using different cutting granularity training is obtained The tentative prediction result of entity class merges, the entity class result after being merged;
32) build category attribute database, utilize the category associations attribute information in category attribute data Kuku, to fusion After entity class be modified, obtain the revised entity class of category associations attribute;
33) utilizing parsing method parsing sentence structure, carrying out deep understanding step 32 by text being described first sentence) The obtained revised entity class of category associations attribute, obtains first sentence deep understanding revised entity class information.
For the above-mentioned entity classification method towards link data, further, step 1) before described segmenting method includes Rear maximum matching process, backward maximum matching process and based on statistical series mask method.
For the above-mentioned entity classification method towards link data, further, step 2) use two kinds of cutting granularities, point Not Wei with name Entity recognition cutting granularity and without name Entity recognition cutting granularity.
For the above-mentioned entity classification method towards link data, further, described statistical classification model is maximum entropy Model;Step 31) to be calculated the different cutting grain-size classification device of fusion especially by formula 1 pre-for described many granularity models fusion method The probability distribution surveyed, carries out classification by the maximum entropy disaggregated model that multiple cutting granularities are trained to physical page and obtains entity class Result merges:
Pmulti(y | x)=λ Pw(y|x)+(1-λ)Pn(y | x) (formula 1)
In formula 1, Pmulti(y | x) for merging the probability distribution of different cutting grain-size classification device prediction;Pw(y | x) it is a word The probability distribution that cutting is predicted for sample x as feature maximum entropy disaggregated model;Y is sample class, and x is sample;Pn(y|x) For adding the name entity mark probability distribution as the maximum entropy prediction of feature on the basis of word segmentation;λ is to adjust linear inserting The parameter of value weight.
For the above-mentioned entity classification method towards link data, further, step 33) described utilize syntactic analysis side Method parsing sentence structure, obtains first sentence deep understanding revised entity class information, specifically includes following steps:
331) the first sentence of entity description is carried out interdependent syntactic analysis, identify whether the object of first sentence belongs to and judge sentence guest Language;
332) on extensive un-annotated data, train Chinese word vector, define Similarity of Words, calculate term vector With judge the Similarity of Words of sentential object, obtain the term vector that Similarity of Words is the highest;
333) use cosine similarity computational methods, set cosine similarity threshold value, when judging that sentential object is most like with it The cosine similarity of the term vector of classification is more than cosine similarity threshold value, and the classification of this entity is modified to most like classification.
For the above-mentioned entity classification method towards link data, further, to entity class processing stage of in the rear Other tentative prediction result is modified, and after obtaining revised entity classification classification, uses puzzled matrix to identify difficulty Entity class;For the difficult entity class identified, by link analysis method and affix analysis method, entity class is tied Fruit is verified;Described puzzled matrix recognition methods is specifically: on checking collection, when statistical classification model is for a certain entity class Other yiPrecision of prediction when being not up to 90%, classification yiIt is considered difficulty entity class.
Further, described link analysis method is specifically: set the class prediction that physical page e is made by grader For y ', the set of physical page physical page e linked is designated as N (e), finds out the page having classification to mark in N (e), system Count the classification that the page obtaining having classification to mark in N (e) is most, be denoted as y*;When classification y* is inconsistent with class prediction y ', profit Revise the result of y ' with y*, the classification obtaining physical page e is y*.
For the above-mentioned entity classification method towards link data, further, described affix analysis method is specifically: pin The entity class ended up entity name with fixing Chinese character, utilizes the entity type obtained without labeled data study to be on a large scale correlated with The affixe information of connection, by respectively the affixe of the most close vocabulary being carried out frequency statistics, obtains what difficulty entity class was associated Affixe, obtains the classification of described entity by analyzing affixe.
The present invention also provides for utilizing the above-mentioned reality towards link data realized towards the entity classification method of link data Body categorizing system, including pretreatment module, statistical classification module and post-processing module;Described pretreatment module is for physical page Text in face describes information and carries out participle, the word information that message box attribute-name and participle are obtained as feature extraction out, Character representation as physical page;Described statistical classification module carrys out train classification models by employing maximum entropy sorting algorithm, The description information identification to entity is utilized in physical page to obtain entity class;Described post-processing module is used for using many granularities mould Type fusion, category associations attribute and first sentence deeply understand that the entity class obtaining described statistical classification module is modified, To revised entity class.
Above-mentioned in the entity classification system of link data, described participle instrument is Stanford CoreNLP instrument Bag;Described disaggregated model uses maximum entropy classifiers software kit Maxent.
Compared with prior art, the invention has the beneficial effects as follows:
The present invention provides a kind of entity classification method and system towards link data, for the entity classification of link data Problem, reaches the purpose of high precisely entity classification by statistical classification process and last handling process.Wherein, text is being carried out On the basis of basic classification, the result for entity description text classification is modified, and employing method includes:
(1) use many granularities word segmentation Model Fusion method, be used for overcoming single cutting granularity to extract at text feature On defect;
(2) utilize category associations attribute information that the entity class after merging is modified, to reach to revise apparent error Purpose;
(3) deeply understood by first sentence, reduce the effect of text noise;
(4) it is capable of identify that difficulty sample, and uses the method such as link analysis and affixe to verify recognition result.
Compared with prior art, current existing entity classification method no longer processes, can for Entity recognition classification The situation of energy mistake cannot correction result;And the present invention is by post-processing flow process to based on the possible mistake of text statistical classification module Situation be modified.Technical scheme proposed by the invention easily realizes, easily debugging, efficiency is high, precision is good, is especially suitable for enterprise It is used for linking data and carries out information management;Entity can be carried out high exact classification.In the evaluation and test match of JIST2015 entity classification In, the solution of the present invention accuracy rate is 98.6%, for evaluating and testing, when secondary, the classification schemes that match accuracy rate is the highest.
Accompanying drawing explanation
Fig. 1 is the FB(flow block) of the entity classification method towards link data that the present invention provides.
Fig. 2 is the structured flowchart of the entity classification system towards link data that the embodiment of the present invention provides.
Fig. 3 is the FB(flow block) that in offer method of the present invention, first sentence deeply understands step.
Detailed description of the invention
Below in conjunction with the accompanying drawings, further describe the present invention by embodiment, but limit the model of the present invention never in any form Enclose.
The present invention provides a kind of entity classification method and system towards link data, for the entity classification of link data Problem, reaches the purpose of high precisely entity classification by statistical classification process and last handling process;Wherein, statistical classification process By classifying for text message modeling;Last handling process utilizes affluent resources (such as affixe information, link data etc. Information) result of entity statistical classification is modified, Fig. 1 is the entity classification method for link data that the present invention provides FB(flow block).As it is shown in figure 1, the inventive method includes preprocessing process, statistical classification process and last handling process;The most right Physical page carries out participle feature extraction, then utilizes the features training statistical classification model that extraction obtains.For classification gained The result arrived, we revise single model prediction error first with the fusion of many granularity models, then utilize category associations attribute Entity class after merging is modified by information, revises the prediction of some manifest error, then retouches the first sentence of physical page State and carry out depth analysis, determine its classification.Being difficult to the sample of the classification correctly classified for some, the present invention can be by link Analyze and its classification is revised by affix analysis method again.Concrete steps include:
1) physical page is pre-processed, including Chinese word segmenting (typical segmenting method have before and after maximum coupling, after To maximum coupling and method based on statistical series mark), feature extraction (extraction word feature and entity information box properties name The page is indicated by feature) etc., obtain physical page feature;
2) utilize step 1) in the physical page feature that obtains of extraction, physical page utilizes maximum entropy model use multiple Cutting granularity is classified, and obtains the tentative prediction to entity class;
In embodiments of the present invention, maximum entropy model is utilized to train two graders;The character representation of one grader is used Be with name Entity recognition granularity cutting word+infobox attribute;Another grader is without name entity Identify word and the infobox attribute of the cutting generation carried out.
3) to step 2) in obtained by entity class post-process, verify that its classification results is the most reliable;Concrete bag Include following steps:
31) classification results of the grader of the features training utilizing different cutting granularity is merged;
In embodiments of the present invention, use two kinds of cutting granularities, refer to respectively with name Entity recognition cutting and without There is name Entity recognition;
32) build category attribute database in advance, utilize the category associations attribute information in category attribute data Kuku to repair Just significantly prediction error;
33) by parser, text is described first sentence and carry out deep understanding, utilize the methods analyst sentences such as syntactic analysis Minor structure, thus obtain entity class information, predicting the outcome before revising;
34) utilizing puzzled matrix identification to be difficult to the classification correctly classified, the prediction to the category is verified further, bag Include:
341) entity class is modified by the classification of the adjacent page that use physical page is linked;
342) use the affixe information of physical page that entity class is modified.
Fig. 2 is the structured flowchart of the entity classification system towards link data that the embodiment of the present invention provides.Link data Entity classification system include pretreatment module, statistical classification module and post-processing module;For each module be further discussed below as Under:
Pretreatment module
Physical page in link data generally comprises text and describes and message box (infobox).
In pretreatment module, we make use of Stanford CoreNLP kit to describe the text in physical page Information carries out participle.In the present embodiment, we take two kinds of different cutting granularities: have name Entity recognition and without name entity Identify.Such as, under the cutting having name Entity recognition, " New York Times Square " will be considered a vocabulary, and without name Under the cutting of Entity recognition, this word will be split as " New York ", " epoch ", " square " three words.
After carrying out cutting for Chinese language text, message box (infobox) attribute-name is obtained by we together with cutting Word information as feature extraction out, as the character representation of physical page.
Statistical classification module
The present invention mainly utilizes in physical page the description information to entity to be used as judging the foundation of entity class.This The bright log-linear model maximum entropy sorting algorithm that have employed natural language processing field conventional carrys out train classification models.As Pretreatment module is previously mentioned, and the feature used in statistical classification module includes word feature and message box attributive character;Word feature is Classical word bag model character representation;Message box attributive character has very important effect, example for the classification identifying entity As, " date of birth " is likely to be associated with the entity of personage's type.
In text classification module, we have employed varigrained word segmentation and carry out training text disaggregated model, this is because In some cases, a kind of cutting granularity can not meet the requirement for classification.Such as, " New York Times Square " if conduct If one name entity is treated, for the effect of classification and not as being cut into " New York " " epoch " and " square ", because of Vital impact is had for classification for " square " word.On the other hand, if we are not named Entity recognition, that As " Zhang Yishan " will be cut into " opening " " one " " mountain ", then classification results also can be impacted by this.Therefore, in statistics In sort module, the embodiment of the present invention (can be downloaded maximum by following link website by maximum entropy classifiers software kit Maxent Entropy grader software kit: http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html) instruction Practiced two kinds of disaggregated models, a kind of be the fine granularity cutting with name Entity recognition, a kind of be simple coarseness word segmentation.
Post-processing module
Situation based on the possible mistake of text statistical classification module, the present invention utilizes post-processing module to be modified.After Processing module can perform procedure below:
31) many granularity models fusion process
Model Fusion is widely used in machine learning field as far as possible, but the method for Model Fusion is both for difference mostly Plant the fusion of machine learning model.For natural language (especially Chinese), the difference of cutting granularity is for whole model Effect can produce impact.For the respective superiority-inferiority of different cutting granularities, the present invention proposes the method utilizing Model Fusion Disaggregated model obtained by various cutting granularities is carried out " learning from other's strong points to offset one's weaknesses ".
We define Pw(y | x) it is the classification only sample x predicted as feature, maximum entropy disaggregated model with word segmentation Y probability distribution, Pn(y | x) name entity mark to divide as the probability of the maximum entropy prediction of feature for adding on the basis of word segmentation Cloth.The result of both graders is merged by using the following method:
Pmulti(y | x)=λ Pw(y|x)+(1-λ)Pn(y | x) (formula 1)
In formula 1, Pmulti(y | x) for merging the probability distribution of different cutting grain-size classification device prediction;Pw(y | x) it is a word The probability distribution that cutting is predicted for sample x as feature maximum entropy disaggregated model;Y is sample class, and x is sample;Pn(y|x) For adding the name entity mark probability distribution as the maximum entropy prediction of feature on the basis of word segmentation;λ is to adjust linear inserting The parameter of value weight, in the present embodiment, if λ=0.5.
32) category associations attribute correction prediction
This module utilizes the class prediction of category associations attribute some apparent errors of correction.What this module was utilized is mainly The classification of message box attribute is specific.As shown in table 1, for some attribute, they can not be with some specific classification It is associated.Such as, " gaming platform " can not be associated with city entity.Therefore, the specific of these attributes, Ke Yixiu are utilized The positive obvious prediction error of grader.The entity type that the present invention is directed to predefine manually establishes category attribute database, It is used for carrying out the correction to prediction.
Table 1 category associations exemplary properties
33) deeply understood the first sentence of entity description by interdependent parser, the most precisely identify entity class;
A word that physical page in link data (such as: wikipedia, Baidupedia etc.) describes is the most right The qualitative description of entity (such as: pounding six is a kind of snipsnapsnorum being popular in Tianjin).If able to deeply understand entity The first sentence described, then will be to precisely identifying that entity class has the biggest help.
Fig. 3 is the FB(flow block) that in offer method of the present invention, first sentence deeply understands step.The present invention is first with interdependent sentence Method analyzer finds out the judgement sentential object that physical page text describes in first sentence, then utilizes this judgement sentential object analysis entities The classification of the page;Specifically include following steps:
331) sentential object identification is judged
Present invention utilizes the interdependent parser of Stanford University, the first sentence of entity description is carried out interdependent syntax and divides Analysis, analyzes subject, predicate and the object in sentence of informing against.If the object of the first sentence obtained by interdependent syntax and "Yes" have directly Dependence, then this object is referred to as " judgement sentential object ";Otherwise, this object is referred to as " non-judgement sentential object ".
As the object of the first sentence described in sporocarp text is for judge sentential object, and we can utilize this object to determine for clue The classification of entity, thus verify that the result that grader is predicted is the most accurate.If the result of grader prediction and punctuate object institute The conclusion contradiction drawn, then utilize the prediction of this modified result grader.If first sentence does not exist judgement sentential object, then skip This step, enters 34).
Such as, in " pounding six is a kind of snipsnapsnorum being popular in Tianjin " sentence, interdependent syntactic analysis interpretation of result Obtaining " game " is this sentential object, and " game " has direct dependence with "Yes", then " game " is this judgement Sentential object.It is predefined entity class in entity classification system if " played ", then we are used as this entity with it Classification.
332) utilization judges sentential object correction class prediction
In some cases, even if we have found judgement sentential object, can not arbitrarily be used for prediction is modified, because of For some unnecessary mistakes so may be introduced.Meanwhile, under many circumstances, it is judged that sentential object Incomplete matching classification Title.Such as: " wild damp refined son is that Japanese famous sound is excellent ", although it is sentencing of the words that interdependent syntactic analysis can obtain " sound is excellent " Disconnected object, but predefined entity class does not likely have " sound is excellent " this classification.To this, invention defines correction Condition, utilizes Similarity of Words, i.e. cosine similarity between term vector, finds judgement from extensive un-annotated data The classification that sentential object is most like, reliably carries out classification correction.
In natural language processing field, cosine similarity is generally regarded as the semantic similarity of vocabulary.Specifically, this Bright embodiment is first with use word2vec kit (https: //word2vec.googlecode.com/svn/ Trunk/) at Gigaword Chinese language material (Chinese Gigaword is disclosed data set) upper training Chinese word vector, instruction is utilized The term vector got is found and judges the semantic most like item name of sentential object.If it is determined that sentential object is most like with it The cosine similarity of the term vector of classification is more than presetting threshold value (in the embodiment of the present invention, by calculating cosine similarity Method, cosine similarity threshold value is set as 0.9), just the classification of this entity is modified to most like classification.
To this end, the judgement sentential object that we define the description of physical page kept man of a noblewoman's sentence text is w0, classifier is that (Y is for real for y ∈ Y Body category set), sim (w1,w2) it is word w1、w2The cosine similarity of term vector.So correction conditions is that formula 2 is such as shown:
y*=argmaxy∈Ysim(w0,y)∧sim(w0,y*) > 0.9 (formula 2)
In formula 2, ^ represent and (with) relation;The content (left side item) of ^ forward part shows that y* is that semantic similarity is the highest Classification, the content (the right item) of ^ rear section represents that the similarity of y* Yu w0 needs higher than 0.9;Correction conditions (formula 2) meets just to enter Row is revised, the most only when the similarity that y* is classification that semantic similarity is the highest and y* Yu w0 needs higher than 0.9, with y* Revise original class prediction.
In previous example (" wild damp refined son is that Japanese famous sound is excellent "), we can find out the class most like with " sound is excellent " Be not " performer " (as shown in table 2, table 2 be utilize that the term vector trained from Chinese gigaword calculates with classification phase As some vocabulary, wherein runic vocabulary shows that the similarity of these vocabulary and classification is more than 0.9), and find " performer " and " sound Excellent " semantic similarity more than 0.9, therefore, the classification of " wild damp refined son " this physical page is modified to " performer ".
The most like vocabulary of table 2 classification
34) puzzled matrix identification difficulty sample is used
In actual applications, we are frequently encountered the sample of some classification and are difficult to differentiate between, and this kind of sample is referred to as difficulty sample This.Such as the entity of " city " and " sight spot " two classifications, grader often does the prediction made mistake, because this two class Description and the message box attribute of entity are the most much like.In order to improve the precision of classification, the present invention uses puzzled matrix to find out The sample class that classificating word is easily made mistakes.Specifically, if on checking collection, statistical classification model is for a certain entity class yiPrecision of prediction be not up to 90%, then classification yiIt is considered difficulty sample class.Such as, on checking collection, statistical classification model 18 physical page are predicted as " city " classification, but wherein only having 15 pages is " city " classification really, therefore statistical Class model is only 83.33% (15/18) at the precision of prediction of " city " classification, and " city " classification is identified as difficulty sample class Not.For those by the sample that statistical classification model prediction is difficulty sample class, we term it difficulty sample.
For the difficult sample identified, we make use of following two method to verify result.
341) link analysis
For difficulty sample, depend merely on the content in physical page and may be not enough to make correct judgement, therefore, the present invention Have employed link analysis method and difficulty sample is carried out classification results checking.
In link data, physical page would generally be linked to other physical page relative.Generally come Saying, the classification of its other physical page being linked to is very likely to the identical of classification with itself.Therefore, one is utilized The classification of other physical page that physical page is linked to, can preferably judge the classification of this entity with help system.
Specifically, for certain physical page e, we analyze the physical page that e is linked, and its set is designated as N (e).N E () understands some page have classification markup information.The present invention finds out the page having classification to mark in N (e), and counts this Classification y* that a little pages are most, it is judged that the category is the most consistent with the class prediction y ' that e is made by grader.As result differs Cause, utilize y* to revise the result of y '.
342) affix analysis
The sample being difficult to differentiate between for some, the present invention also uses affix analysis method to verify its classification results.For Some classification, its entity name generally ends up with fixing Chinese character.Such as, " city " entity generally ends up with " city, county ", " sight spot " Entity would generally end up with " lake, mountain " etc..Table 3 lists the example of classification common solid affixe.
The classification affixe of table 3 common solid
Sight spot City
Mountain, lake, city, road, scape, river, ridge, hole District, city, county
Present invention firstly provides and utilize the affixe information being associated without labeled data learning object type on a large scale, specifically come Saying, we utilize term vector kit word2vec to train term vector on Chinese Gigaword data set, then by calculating The method of cosine similarity, finds out each classification the most close semantic word (word of term vector cosine similarity more than 0.7).So After, by respectively the affixe of the most close vocabulary at the two sight spot being carried out frequency statistics, it is possible to obtain difficulty sample class The affixe being associated, thus by analyzing affixe, determine its generic.Specifically, if a certain physical page affixe s In a certain classification y1In frequency be significantly higher than (more than 2 times) another category y2In the frequency of occurrences, then we are by y1Real as this Body classification correction is original to predict the outcome.For example, for " Mount Lushan Fairy Cave " physical page, its affixe " hole " occurs in " scape Point " frequency of classification apparently higher than occurring in the frequency of " city " classification, therefore the prediction classification of this entity is modified to " scape Point ".
It should be noted that publicizing and implementing the purpose of example is that help is further appreciated by the present invention, but the skill of this area Art personnel are understood that various substitutions and modifications are all without departing from the present invention and spirit and scope of the appended claims Possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Book defines in the range of standard.

Claims (10)

1., towards an entity classification method for link data, described link data are multiple physical page, described physical page Comprise text to describe and message box;Described entity classification method includes pretreatment stage, statistical classification stage and post-processing stages, Specifically include following steps:
1) in pretreatment stage process, carrying out participle by the text in physical page is described information, cutting obtains word information; The feature of physical page it is made up of the attribute-name of message box and described word information;
2) in the statistical classification stage, utilize the feature of described physical page, use multiple cutting granularity to train statistical classification mould Physical page is classified by type, obtains the tentative prediction result of entity class;
3) in post-processing stages, the tentative prediction result of entity class is modified, obtains revised entity classification classification; Described correction comprises the steps:
31) by many granularity models fusion method, the entity that the statistical classification model using the training of multiple cutting granularities is obtained The tentative prediction result of classification merges, the entity class result after being merged;
32) build category attribute database, utilize the category associations attribute information in category attribute data Kuku, after merging Entity class is modified, and obtains the revised entity class of category associations attribute;
33) utilizing parsing method parsing sentence structure, carrying out deep understanding step 32 by text being described first sentence) gained The revised entity class of category associations attribute arrived, obtains first sentence deep understanding revised entity class information.
2., as claimed in claim 1 towards the entity classification method of link data, it is characterized in that, step 1) described segmenting method bag Maximum matching process before and after including, backward maximum matching process and based on statistical series mask method.
3., as claimed in claim 1 towards the entity classification method of link data, it is characterized in that, step 2) use two kinds of cutting grains Degree, is respectively the cutting granularity with name Entity recognition and the cutting granularity without name Entity recognition.
4. as claimed in claim 1 towards the entity classification method of link data, it is characterized in that, described statistical classification model is Big entropy model;Step 31) described many granularity models fusion method is calculated the different cutting grain-size classification of fusion especially by formula 1 The probability distribution of device prediction, carries out classification by the maximum entropy disaggregated model that multiple cutting granularities are trained to physical page and obtains entity Category result merges:
Pmulti(y | x)=λ Pw(y|x)+(1-λ)Pn(y | x) (formula 1)
In formula 1, Pmulti(y | x) for merging the probability distribution of different cutting grain-size classification device prediction;Pw(y | x) for only using word segmentation Probability distribution sample x predicted as feature maximum entropy disaggregated model;Y is sample class, and x is sample;Pn(y | x) be The name entity mark probability distribution as the maximum entropy prediction of feature is added on the basis of word segmentation;λ is to adjust linear interpolation power The parameter of weight.
5., as claimed in claim 1 towards the entity classification method of link data, it is characterized in that, step 33) described utilize grammer Analytical sentence structure, obtains first sentence deep understanding revised entity class information, specifically includes following steps:
331) the first sentence of entity description is carried out interdependent syntactic analysis, identify whether the object of first sentence belongs to judgement sentential object;
332) on extensive un-annotated data, train Chinese word vector, define Similarity of Words, calculate term vector and sentence The Similarity of Words of punctuate object, obtains the term vector that Similarity of Words is the highest;
333) by cosine similarity computational methods, cosine similarity threshold value is set, when judging sentential object classification most like with it The cosine similarity of term vector more than cosine similarity threshold value, the classification of this entity is modified to most like classification.
6., as claimed in claim 1 towards the entity classification method of link data, it is characterized in that, to reality processing stage of in the rear The tentative prediction result of body classification is modified, and after obtaining revised entity classification classification, uses puzzled matrix to identify Difficulty entity class;For the difficult entity class identified, by link analysis method and affix analysis method to entity class Other result is verified;Described puzzled matrix recognition methods is specifically: on checking collection, when statistical classification model is for a certain reality Body classification yiPrecision of prediction when being not up to 90%, classification yiIt is considered difficulty entity class.
7. as claimed in claim 6 towards the entity classification method of link data, it is characterized in that, described link analysis method is concrete It is: set class prediction that physical page e made by grader as y ', the collection of physical page physical page e linked Conjunction is designated as N (e), finds out the page having classification to mark in N (e), adds up the class that the page obtaining having classification to mark in N (e) is most Not, it is denoted as y*;When classification y* is inconsistent with class prediction y ', utilizes y* to revise the result of y ', obtain the class of physical page e Wei y*.
8. as claimed in claim 6 towards the entity classification method of link data, it is characterized in that, described affix analysis method is concrete It is: the entity class ended up with fixing Chinese character for entity name to utilize the entity class obtained on a large scale without labeled data study The affixe information that type is associated, by respectively the affixe of the most close vocabulary being carried out frequency statistics, obtains difficulty entity class phase The affixe of association, obtains the classification of described entity by analyzing affixe.
9. utilize the entity towards link data realized described in claim 1~8 towards the entity classification method of link data Categorizing system, is characterized in that, including pretreatment module, statistical classification module and post-processing module;
Described pretreatment module carries out participle, by message box attribute-name and participle for the text in physical page is described information The word information obtained as feature extraction out, as the character representation of physical page;
Described statistical classification module, by using maximum entropy sorting algorithm to carry out train classification models, utilizes in physical page entity Description information identification obtain entity class;
Described post-processing module is used for using the fusion of many granularity models, category associations attribute and first sentence deeply to understand described statistics The entity class that sort module obtains is modified, and obtains revised entity class.
10. as claimed in claim 9 towards the entity classification system of link data, it is characterized in that, described participle instrument is Stanford CoreNLP kit;Described disaggregated model uses maximum entropy classifiers software kit Maxent.
CN201610213411.8A 2016-04-07 2016-04-07 A kind of entity classification method and system towards link data Expired - Fee Related CN105912625B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610213411.8A CN105912625B (en) 2016-04-07 2016-04-07 A kind of entity classification method and system towards link data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610213411.8A CN105912625B (en) 2016-04-07 2016-04-07 A kind of entity classification method and system towards link data

Publications (2)

Publication Number Publication Date
CN105912625A true CN105912625A (en) 2016-08-31
CN105912625B CN105912625B (en) 2019-05-14

Family

ID=56745356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610213411.8A Expired - Fee Related CN105912625B (en) 2016-04-07 2016-04-07 A kind of entity classification method and system towards link data

Country Status (1)

Country Link
CN (1) CN105912625B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
CN107506362A (en) * 2016-11-23 2017-12-22 上海大学 Image classification based on customer group optimization imitates brain storage method
CN107515858A (en) * 2017-09-01 2017-12-26 北京神州泰岳软件股份有限公司 A kind of text classification post-processing approach, apparatus and system
CN107885719A (en) * 2017-09-20 2018-04-06 北京百度网讯科技有限公司 Vocabulary classification method for digging, device and storage medium based on artificial intelligence
CN108241650A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 The training method and device of training criteria for classification
CN108415902A (en) * 2018-02-10 2018-08-17 合肥工业大学 A kind of name entity link method based on search engine
WO2018166499A1 (en) * 2017-03-17 2018-09-20 腾讯科技(深圳)有限公司 Text classification method and device, and storage medium
CN108563722A (en) * 2018-04-03 2018-09-21 有米科技股份有限公司 Trade classification method, system, computer equipment and the storage medium of text message
CN108921213A (en) * 2018-06-28 2018-11-30 国信优易数据有限公司 A kind of entity classification model training method and device
CN109284374A (en) * 2018-09-07 2019-01-29 百度在线网络技术(北京)有限公司 For determining the method, apparatus, equipment and computer readable storage medium of entity class
CN110020024A (en) * 2019-03-15 2019-07-16 叶宇铭 Classification method, system, the equipment of link resources in a kind of scientific and technical literature
CN110168579A (en) * 2016-11-23 2019-08-23 启创互联公司 For using the system and method for the representation of knowledge using Machine learning classifiers
CN110309255A (en) * 2018-03-07 2019-10-08 同济大学 A kind of entity search method for incorporating entity description distribution and indicating
CN110457467A (en) * 2019-07-02 2019-11-15 厦门美域中央信息科技有限公司 A kind of information technology file classification method based on gauss hybrid models
CN111538813A (en) * 2020-04-26 2020-08-14 北京锐安科技有限公司 Classification detection method, device, equipment and storage medium
CN111737416A (en) * 2020-06-29 2020-10-02 重庆紫光华山智安科技有限公司 Case processing model training method, case text processing method and related device
CN111798941A (en) * 2019-04-04 2020-10-20 Iqvia 有限公司 Predictive system for generating clinical queries
CN112115240A (en) * 2019-06-21 2020-12-22 百度在线网络技术(北京)有限公司 Classification processing method, classification processing device, server and storage medium
CN112214572A (en) * 2020-10-20 2021-01-12 济南浪潮高新科技投资发展有限公司 Method for secondarily extracting entities in resume analysis
CN113094567A (en) * 2021-03-31 2021-07-09 四川新网银行股份有限公司 Malicious complaint identification method and system based on text clustering
CN113169890A (en) * 2018-09-28 2021-07-23 普缪蒂夫有限公司 Fast and efficient classification system
CN113297851A (en) * 2021-06-21 2021-08-24 北京富通东方科技有限公司 Recognition method for confusable sports injury entity words
CN113343697A (en) * 2021-06-15 2021-09-03 中国科学院软件研究所 Network protocol entity extraction method and system based on small sample learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645064A (en) * 2008-12-16 2010-02-10 中国科学院声学研究所 Superficial natural spoken language understanding system and method thereof
CN104408148A (en) * 2014-12-03 2015-03-11 复旦大学 Field encyclopedia establishment system based on general encyclopedia websites
CN104484461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Method and system based on encyclopedia data for classifying entities
US20150324454A1 (en) * 2014-05-12 2015-11-12 Diffeo, Inc. Entity-centric knowledge discovery
CN102436456B (en) * 2010-09-29 2016-03-30 国际商业机器公司 For the method and apparatus of classifying to named entity
US20160092476A1 (en) * 2014-09-26 2016-03-31 Oracle International Corporation Declarative external data source importation, exportation, and metadata reflection utilizing http and hdfs protocols

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645064A (en) * 2008-12-16 2010-02-10 中国科学院声学研究所 Superficial natural spoken language understanding system and method thereof
CN102436456B (en) * 2010-09-29 2016-03-30 国际商业机器公司 For the method and apparatus of classifying to named entity
US20150324454A1 (en) * 2014-05-12 2015-11-12 Diffeo, Inc. Entity-centric knowledge discovery
US20160092476A1 (en) * 2014-09-26 2016-03-31 Oracle International Corporation Declarative external data source importation, exportation, and metadata reflection utilizing http and hdfs protocols
CN104408148A (en) * 2014-12-03 2015-03-11 复旦大学 Field encyclopedia establishment system based on general encyclopedia websites
CN104484461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Method and system based on encyclopedia data for classifying entities

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LU CHUNLIANG: ""Entity modeling and search in Text"", 《HTTPS://STATIC.CHUNLIANGLYU.COM/DOCS/THESIS.PDF》 *

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506362A (en) * 2016-11-23 2017-12-22 上海大学 Image classification based on customer group optimization imitates brain storage method
CN110168579A (en) * 2016-11-23 2019-08-23 启创互联公司 For using the system and method for the representation of knowledge using Machine learning classifiers
CN107506362B (en) * 2016-11-23 2021-02-23 上海大学 Image classification brain-imitation storage method based on user group optimization
CN108241650B (en) * 2016-12-23 2020-08-11 北京国双科技有限公司 Training method and device for training classification standard
CN108241650A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 The training method and device of training criteria for classification
CN106777275B (en) * 2016-12-29 2018-03-06 北京理工大学 Entity attribute and property value extracting method based on more granularity semantic chunks
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
WO2018166499A1 (en) * 2017-03-17 2018-09-20 腾讯科技(深圳)有限公司 Text classification method and device, and storage medium
CN108628873A (en) * 2017-03-17 2018-10-09 腾讯科技(北京)有限公司 A kind of file classification method, device and equipment
CN107515858B (en) * 2017-09-01 2020-10-20 鼎富智能科技有限公司 Text classification post-processing method, device and system
CN107515858A (en) * 2017-09-01 2017-12-26 北京神州泰岳软件股份有限公司 A kind of text classification post-processing approach, apparatus and system
CN107885719A (en) * 2017-09-20 2018-04-06 北京百度网讯科技有限公司 Vocabulary classification method for digging, device and storage medium based on artificial intelligence
CN107885719B (en) * 2017-09-20 2021-06-11 北京百度网讯科技有限公司 Vocabulary category mining method and device based on artificial intelligence and storage medium
CN108415902A (en) * 2018-02-10 2018-08-17 合肥工业大学 A kind of name entity link method based on search engine
CN108415902B (en) * 2018-02-10 2021-10-26 合肥工业大学 Named entity linking method based on search engine
CN110309255A (en) * 2018-03-07 2019-10-08 同济大学 A kind of entity search method for incorporating entity description distribution and indicating
CN108563722A (en) * 2018-04-03 2018-09-21 有米科技股份有限公司 Trade classification method, system, computer equipment and the storage medium of text message
CN108921213A (en) * 2018-06-28 2018-11-30 国信优易数据有限公司 A kind of entity classification model training method and device
CN108921213B (en) * 2018-06-28 2021-06-22 国信优易数据股份有限公司 Entity classification model training method and device
US11526663B2 (en) 2018-09-07 2022-12-13 Baidu Online Network Technology (Beijing) Co., Ltd. Methods, apparatuses, devices, and computer-readable storage media for determining category of entity
CN109284374B (en) * 2018-09-07 2024-07-05 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer readable storage medium for determining entity class
CN109284374A (en) * 2018-09-07 2019-01-29 百度在线网络技术(北京)有限公司 For determining the method, apparatus, equipment and computer readable storage medium of entity class
CN113169890A (en) * 2018-09-28 2021-07-23 普缪蒂夫有限公司 Fast and efficient classification system
CN110020024A (en) * 2019-03-15 2019-07-16 叶宇铭 Classification method, system, the equipment of link resources in a kind of scientific and technical literature
CN110020024B (en) * 2019-03-15 2021-07-30 中国人民解放军军事科学院军事科学信息研究中心 Method, system and equipment for classifying link resources in scientific and technological literature
CN111798941A (en) * 2019-04-04 2020-10-20 Iqvia 有限公司 Predictive system for generating clinical queries
CN111798941B (en) * 2019-04-04 2023-10-13 Iqvia 有限公司 Predictive system for generating clinical queries
CN112115240A (en) * 2019-06-21 2020-12-22 百度在线网络技术(北京)有限公司 Classification processing method, classification processing device, server and storage medium
CN110457467A (en) * 2019-07-02 2019-11-15 厦门美域中央信息科技有限公司 A kind of information technology file classification method based on gauss hybrid models
CN111538813B (en) * 2020-04-26 2023-05-16 北京锐安科技有限公司 Classification detection method, device, equipment and storage medium
CN111538813A (en) * 2020-04-26 2020-08-14 北京锐安科技有限公司 Classification detection method, device, equipment and storage medium
CN111737416B (en) * 2020-06-29 2022-08-19 重庆紫光华山智安科技有限公司 Case processing model training method, case text processing method and related device
CN111737416A (en) * 2020-06-29 2020-10-02 重庆紫光华山智安科技有限公司 Case processing model training method, case text processing method and related device
CN112214572B (en) * 2020-10-20 2022-11-01 山东浪潮科学研究院有限公司 Method for secondarily extracting entities in resume analysis
CN112214572A (en) * 2020-10-20 2021-01-12 济南浪潮高新科技投资发展有限公司 Method for secondarily extracting entities in resume analysis
CN113094567A (en) * 2021-03-31 2021-07-09 四川新网银行股份有限公司 Malicious complaint identification method and system based on text clustering
CN113343697A (en) * 2021-06-15 2021-09-03 中国科学院软件研究所 Network protocol entity extraction method and system based on small sample learning
CN113297851A (en) * 2021-06-21 2021-08-24 北京富通东方科技有限公司 Recognition method for confusable sports injury entity words
CN113297851B (en) * 2021-06-21 2024-03-05 北京富通东方科技有限公司 Identification method for confusable sports injury entity words

Also Published As

Publication number Publication date
CN105912625B (en) 2019-05-14

Similar Documents

Publication Publication Date Title
CN105912625B (en) A kind of entity classification method and system towards link data
CN106484664B (en) Similarity calculating method between a kind of short text
TWI735543B (en) Method and device for webpage text classification, method and device for webpage text recognition
CN103235772B (en) A kind of text set character relation extraction method
Demir et al. Improving named entity recognition for morphologically rich languages using word embeddings
CN104391860B (en) content type detection method and device
CN106021410A (en) Source code annotation quality evaluation method based on machine learning
CN106777957B (en) The new method of biomedical more ginseng event extractions on unbalanced dataset
CN110765235B (en) Training data generation method, device, terminal and readable medium
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN108376133A (en) The short text sensibility classification method expanded based on emotion word
CN110781673B (en) Document acceptance method and device, computer equipment and storage medium
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN103309862A (en) Webpage type recognition method and system
CN110147552B (en) Education resource quality evaluation mining method and system based on natural language processing
CN106649250A (en) Method and device for identifying emotional new words
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN105786898B (en) A kind of construction method and device of domain body
CN108197178A (en) A kind of business risk appraisal procedure
CN113590810A (en) Abstract generation model training method, abstract generation device and electronic equipment
CN113312476A (en) Automatic text labeling method and device and terminal
CN112613321A (en) Method and system for extracting entity attribute information in text
CN112818110A (en) Text filtering method, text filtering equipment and computer storage medium
CN107451116B (en) Statistical analysis method for mobile application endogenous big data
CN111814476A (en) Method and device for extracting entity relationship

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190514