CN105912625A

CN105912625A - Linked data oriented entity classification method and system

Info

Publication number: CN105912625A
Application number: CN201610213411.8A
Authority: CN
Inventors: 葛涛; 穗志方
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2016-04-07
Filing date: 2016-04-07
Publication date: 2016-08-31
Anticipated expiration: 2036-04-07
Also published as: CN105912625B

Abstract

The invention discloses a linked data oriented entity classification method and system which are aimed at the problem of entity classification of linked data. The entity classification method includes pretreatment, statistical classification and post treatment. Pretreatment: word segmentation is performed on text description information in an entity page; and an attribute name of an information box and word information obtained by segmentation form an entity page character. Statistical classification: a statistical classification model is trained through various segmentation granularities to classify the entity page, and then a primary prediction result of entity class can be obtained. Post treatment: the entity statistical classification result is corrected; and a combined entity class is corrected through model combination, language knowledge, linkage information and class associate attribute information. The method and the system is easy to implement and debug, is high in efficiency, is good in accuracy, can be used for performing knowledge management on the linked data, and achieve high-precision classification of the entity.

Description

A kind of entity classification method and system towards link data

Technical field

The invention belongs to field of information processing, relate to linking data classification and search, particularly relate to a kind of towards link number Physical page according to carries out the method and system of high exact classification.

Background technology

It is in big data age at present, how to maximally utilise data to help computer to carry out information processing Become the research topic that current information process field is the most popular.In recent years, along with the arrival in Web2.0 epoch, link data (such as semantic net, knowledge mapping etc.), because its powerful relationship description ability, has obtained the extensive concern of people.Link data Refer to as Baidupedia, the data organizational form of wikipedia, in this data, the corresponding entity of each page, inter-entity There is mutual link, be therefore referred to as linking data (linked data).Along with the continuous increase of data scale, use artificial Management by methods link data are the most unrealistic, in the urgent need to carrying out link data the high efficiency method of information management and be System.

The entity classification of link data is an important technological problems of link data knowledge management domain, for link number According to carrying out entity classification, it is possible to substantial amounts of physical page in organizing links data effectively, thus strengthen user's search and read Experience.

At present, the common method of entity classification is that the description text for entity is classified.But, this simple side Method can not analyze the classification of entity under many circumstances exactly, and its deficiency is mainly manifested in:

(1) for people, judge that entity class is a thing easily although describing according to text, but For the statistical classification method being currently based on feature, it is desirable to height is described by text accurately and judges that entity class is not existing Real；Such as, text " X is the animation according to famous game reorganization " and " A is the game according to famous cartoon making " are in vocabulary level Do not have a closely similar expression, but the former is the description to an animation entity and the latter is the description to game entity, Its entity type described is entirely different.Therefore, the statistical classification method accuracy of identification being based purely on text feature is not enough, not Entity class can be obtained accurately.

(2) a lot of physical page do not have enough texts and describe information, in this case, utilize merely text to describe Entity is classified by information, inevitably results in classification error, is described by text and cannot obtain entity class.

Summary of the invention

In order to overcome above-mentioned the deficiencies in the prior art, the present invention provide a kind of towards link data entity classification method and System, for the entity classification problem of link data, reaches high precisely entity by statistical classification process and last handling process The purpose of classification；Wherein, statistical classification process is by classifying for text message modeling；Last handling process utilizes abundant The result of entity statistical classification is modified, including Model Fusion, language by resource (information such as such as affixe information, link data) Say knowledge, link information and utilize category associations attribute information to methods such as the entity class after merging are modified.

Physical page in link data generally comprises text and describes and message box (infobox).Text is retouched by the present invention Stating after carrying out cutting, word information message box (infobox) attribute-name obtained together with cutting as feature extraction out, is made Character representation for physical page；Then, physical page utilize maximum entropy model use multiple cutting granularity to classify, To the tentative prediction to entity class；Again obtained entity class is post-processed, to verify that its classification results whether may be used Lean on；Post processing specifically includes the classification results of the grader to the features training utilizing different cutting granularity and merges；Utilize The category associations obvious prediction error of attribute information correction in category attribute data Kuku；Text is described first sentence and carries out the degree of depth Understand, utilize the methods analyst sentence structures such as syntactic analysis, obtain entity class information, predicting the outcome before revising；Excellent Selection of land, is also with the classification that puzzled matrix identification is difficult to correctly classify, and carries out for the prediction of classification being difficult to correctly classify Further checking, is modified entity class including the classification of the adjacent page using physical page to be linked and uses entity Entity class is modified by the affixe information of the page.

Present invention provide the technical scheme that

A kind of entity classification method towards link data, described link data are multiple physical page, described physical page Bread describes and message box containing text；Described entity classification method includes pretreatment stage, statistical classification stage and post processing rank Section, specifically includes following steps:

1) in pretreatment stage process, carrying out participle by the text in physical page is described information, cutting obtains word Information；The feature of physical page it is made up of the attribute-name of message box and described word information；

2) in the statistical classification stage, utilize the feature of described physical page, use multiple cutting granularity to train statistical Physical page is classified by class model, obtains the tentative prediction result of entity class；

3) in post-processing stages, the tentative prediction result of entity class is modified, obtains revised entity classification Classification；Described correction comprises the steps:

31) by many granularity models fusion method, the statistical classification model using different cutting granularity training is obtained The tentative prediction result of entity class merges, the entity class result after being merged；

32) build category attribute database, utilize the category associations attribute information in category attribute data Kuku, to fusion After entity class be modified, obtain the revised entity class of category associations attribute；

33) utilizing parsing method parsing sentence structure, carrying out deep understanding step 32 by text being described first sentence) The obtained revised entity class of category associations attribute, obtains first sentence deep understanding revised entity class information.

For the above-mentioned entity classification method towards link data, further, step 1) before described segmenting method includes Rear maximum matching process, backward maximum matching process and based on statistical series mask method.

For the above-mentioned entity classification method towards link data, further, step 2) use two kinds of cutting granularities, point Not Wei with name Entity recognition cutting granularity and without name Entity recognition cutting granularity.

For the above-mentioned entity classification method towards link data, further, described statistical classification model is maximum entropy Model；Step 31) to be calculated the different cutting grain-size classification device of fusion especially by formula 1 pre-for described many granularity models fusion method The probability distribution surveyed, carries out classification by the maximum entropy disaggregated model that multiple cutting granularities are trained to physical page and obtains entity class Result merges:

P_multi(y | x)=λ P_w(y|x)+(1-λ)P_n(y | x) (formula 1)

In formula 1, P_multi(y | x) for merging the probability distribution of different cutting grain-size classification device prediction；P_w(y | x) it is a word The probability distribution that cutting is predicted for sample x as feature maximum entropy disaggregated model；Y is sample class, and x is sample；P_n(y|x) For adding the name entity mark probability distribution as the maximum entropy prediction of feature on the basis of word segmentation；λ is to adjust linear inserting The parameter of value weight.

For the above-mentioned entity classification method towards link data, further, step 33) described utilize syntactic analysis side Method parsing sentence structure, obtains first sentence deep understanding revised entity class information, specifically includes following steps:

331) the first sentence of entity description is carried out interdependent syntactic analysis, identify whether the object of first sentence belongs to and judge sentence guest Language；

332) on extensive un-annotated data, train Chinese word vector, define Similarity of Words, calculate term vector With judge the Similarity of Words of sentential object, obtain the term vector that Similarity of Words is the highest；

333) use cosine similarity computational methods, set cosine similarity threshold value, when judging that sentential object is most like with it The cosine similarity of the term vector of classification is more than cosine similarity threshold value, and the classification of this entity is modified to most like classification.

For the above-mentioned entity classification method towards link data, further, to entity class processing stage of in the rear Other tentative prediction result is modified, and after obtaining revised entity classification classification, uses puzzled matrix to identify difficulty Entity class；For the difficult entity class identified, by link analysis method and affix analysis method, entity class is tied Fruit is verified；Described puzzled matrix recognition methods is specifically: on checking collection, when statistical classification model is for a certain entity class Other y_iPrecision of prediction when being not up to 90%, classification y_iIt is considered difficulty entity class.

Further, described link analysis method is specifically: set the class prediction that physical page e is made by grader For y ', the set of physical page physical page e linked is designated as N (e), finds out the page having classification to mark in N (e), system Count the classification that the page obtaining having classification to mark in N (e) is most, be denoted as y*；When classification y* is inconsistent with class prediction y ', profit Revise the result of y ' with y*, the classification obtaining physical page e is y*.

For the above-mentioned entity classification method towards link data, further, described affix analysis method is specifically: pin The entity class ended up entity name with fixing Chinese character, utilizes the entity type obtained without labeled data study to be on a large scale correlated with The affixe information of connection, by respectively the affixe of the most close vocabulary being carried out frequency statistics, obtains what difficulty entity class was associated Affixe, obtains the classification of described entity by analyzing affixe.

The present invention also provides for utilizing the above-mentioned reality towards link data realized towards the entity classification method of link data Body categorizing system, including pretreatment module, statistical classification module and post-processing module；Described pretreatment module is for physical page Text in face describes information and carries out participle, the word information that message box attribute-name and participle are obtained as feature extraction out, Character representation as physical page；Described statistical classification module carrys out train classification models by employing maximum entropy sorting algorithm, The description information identification to entity is utilized in physical page to obtain entity class；Described post-processing module is used for using many granularities mould Type fusion, category associations attribute and first sentence deeply understand that the entity class obtaining described statistical classification module is modified, To revised entity class.

Above-mentioned in the entity classification system of link data, described participle instrument is Stanford CoreNLP instrument Bag；Described disaggregated model uses maximum entropy classifiers software kit Maxent.

Compared with prior art, the invention has the beneficial effects as follows:

The present invention provides a kind of entity classification method and system towards link data, for the entity classification of link data Problem, reaches the purpose of high precisely entity classification by statistical classification process and last handling process.Wherein, text is being carried out On the basis of basic classification, the result for entity description text classification is modified, and employing method includes:

(1) use many granularities word segmentation Model Fusion method, be used for overcoming single cutting granularity to extract at text feature On defect；

(2) utilize category associations attribute information that the entity class after merging is modified, to reach to revise apparent error Purpose；

(3) deeply understood by first sentence, reduce the effect of text noise；

(4) it is capable of identify that difficulty sample, and uses the method such as link analysis and affixe to verify recognition result.

Compared with prior art, current existing entity classification method no longer processes, can for Entity recognition classification The situation of energy mistake cannot correction result；And the present invention is by post-processing flow process to based on the possible mistake of text statistical classification module Situation be modified.Technical scheme proposed by the invention easily realizes, easily debugging, efficiency is high, precision is good, is especially suitable for enterprise It is used for linking data and carries out information management；Entity can be carried out high exact classification.In the evaluation and test match of JIST2015 entity classification In, the solution of the present invention accuracy rate is 98.6%, for evaluating and testing, when secondary, the classification schemes that match accuracy rate is the highest.

Accompanying drawing explanation

Fig. 1 is the FB(flow block) of the entity classification method towards link data that the present invention provides.

Fig. 2 is the structured flowchart of the entity classification system towards link data that the embodiment of the present invention provides.

Fig. 3 is the FB(flow block) that in offer method of the present invention, first sentence deeply understands step.

Detailed description of the invention

Below in conjunction with the accompanying drawings, further describe the present invention by embodiment, but limit the model of the present invention never in any form Enclose.

The present invention provides a kind of entity classification method and system towards link data, for the entity classification of link data Problem, reaches the purpose of high precisely entity classification by statistical classification process and last handling process；Wherein, statistical classification process By classifying for text message modeling；Last handling process utilizes affluent resources (such as affixe information, link data etc. Information) result of entity statistical classification is modified, Fig. 1 is the entity classification method for link data that the present invention provides FB(flow block).As it is shown in figure 1, the inventive method includes preprocessing process, statistical classification process and last handling process；The most right Physical page carries out participle feature extraction, then utilizes the features training statistical classification model that extraction obtains.For classification gained The result arrived, we revise single model prediction error first with the fusion of many granularity models, then utilize category associations attribute Entity class after merging is modified by information, revises the prediction of some manifest error, then retouches the first sentence of physical page State and carry out depth analysis, determine its classification.Being difficult to the sample of the classification correctly classified for some, the present invention can be by link Analyze and its classification is revised by affix analysis method again.Concrete steps include:

1) physical page is pre-processed, including Chinese word segmenting (typical segmenting method have before and after maximum coupling, after To maximum coupling and method based on statistical series mark), feature extraction (extraction word feature and entity information box properties name The page is indicated by feature) etc., obtain physical page feature；

2) utilize step 1) in the physical page feature that obtains of extraction, physical page utilizes maximum entropy model use multiple Cutting granularity is classified, and obtains the tentative prediction to entity class；

In embodiments of the present invention, maximum entropy model is utilized to train two graders；The character representation of one grader is used Be with name Entity recognition granularity cutting word+infobox attribute；Another grader is without name entity Identify word and the infobox attribute of the cutting generation carried out.

3) to step 2) in obtained by entity class post-process, verify that its classification results is the most reliable；Concrete bag Include following steps:

31) classification results of the grader of the features training utilizing different cutting granularity is merged；

In embodiments of the present invention, use two kinds of cutting granularities, refer to respectively with name Entity recognition cutting and without There is name Entity recognition；

32) build category attribute database in advance, utilize the category associations attribute information in category attribute data Kuku to repair Just significantly prediction error；

33) by parser, text is described first sentence and carry out deep understanding, utilize the methods analyst sentences such as syntactic analysis Minor structure, thus obtain entity class information, predicting the outcome before revising；

34) utilizing puzzled matrix identification to be difficult to the classification correctly classified, the prediction to the category is verified further, bag Include:

341) entity class is modified by the classification of the adjacent page that use physical page is linked；

342) use the affixe information of physical page that entity class is modified.

Fig. 2 is the structured flowchart of the entity classification system towards link data that the embodiment of the present invention provides.Link data Entity classification system include pretreatment module, statistical classification module and post-processing module；For each module be further discussed below as Under:

Pretreatment module

Physical page in link data generally comprises text and describes and message box (infobox).

In pretreatment module, we make use of Stanford CoreNLP kit to describe the text in physical page Information carries out participle.In the present embodiment, we take two kinds of different cutting granularities: have name Entity recognition and without name entity Identify.Such as, under the cutting having name Entity recognition, " New York Times Square " will be considered a vocabulary, and without name Under the cutting of Entity recognition, this word will be split as " New York ", " epoch ", " square " three words.

After carrying out cutting for Chinese language text, message box (infobox) attribute-name is obtained by we together with cutting Word information as feature extraction out, as the character representation of physical page.

Statistical classification module

The present invention mainly utilizes in physical page the description information to entity to be used as judging the foundation of entity class.This The bright log-linear model maximum entropy sorting algorithm that have employed natural language processing field conventional carrys out train classification models.As Pretreatment module is previously mentioned, and the feature used in statistical classification module includes word feature and message box attributive character；Word feature is Classical word bag model character representation；Message box attributive character has very important effect, example for the classification identifying entity As, " date of birth " is likely to be associated with the entity of personage's type.

In text classification module, we have employed varigrained word segmentation and carry out training text disaggregated model, this is because In some cases, a kind of cutting granularity can not meet the requirement for classification.Such as, " New York Times Square " if conduct If one name entity is treated, for the effect of classification and not as being cut into " New York " " epoch " and " square ", because of Vital impact is had for classification for " square " word.On the other hand, if we are not named Entity recognition, that As " Zhang Yishan " will be cut into " opening " " one " " mountain ", then classification results also can be impacted by this.Therefore, in statistics In sort module, the embodiment of the present invention (can be downloaded maximum by following link website by maximum entropy classifiers software kit Maxent Entropy grader software kit: http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html) instruction Practiced two kinds of disaggregated models, a kind of be the fine granularity cutting with name Entity recognition, a kind of be simple coarseness word segmentation.

Post-processing module

Situation based on the possible mistake of text statistical classification module, the present invention utilizes post-processing module to be modified.After Processing module can perform procedure below:

31) many granularity models fusion process

Model Fusion is widely used in machine learning field as far as possible, but the method for Model Fusion is both for difference mostly Plant the fusion of machine learning model.For natural language (especially Chinese), the difference of cutting granularity is for whole model Effect can produce impact.For the respective superiority-inferiority of different cutting granularities, the present invention proposes the method utilizing Model Fusion Disaggregated model obtained by various cutting granularities is carried out " learning from other's strong points to offset one's weaknesses ".

We define P_w(y | x) it is the classification only sample x predicted as feature, maximum entropy disaggregated model with word segmentation Y probability distribution, P_n(y | x) name entity mark to divide as the probability of the maximum entropy prediction of feature for adding on the basis of word segmentation Cloth.The result of both graders is merged by using the following method:

P_multi(y | x)=λ P_w(y|x)+(1-λ)P_n(y | x) (formula 1)

In formula 1, P_multi(y | x) for merging the probability distribution of different cutting grain-size classification device prediction；P_w(y | x) it is a word The probability distribution that cutting is predicted for sample x as feature maximum entropy disaggregated model；Y is sample class, and x is sample；P_n(y|x) For adding the name entity mark probability distribution as the maximum entropy prediction of feature on the basis of word segmentation；λ is to adjust linear inserting The parameter of value weight, in the present embodiment, if λ=0.5.

32) category associations attribute correction prediction

This module utilizes the class prediction of category associations attribute some apparent errors of correction.What this module was utilized is mainly The classification of message box attribute is specific.As shown in table 1, for some attribute, they can not be with some specific classification It is associated.Such as, " gaming platform " can not be associated with city entity.Therefore, the specific of these attributes, Ke Yixiu are utilized The positive obvious prediction error of grader.The entity type that the present invention is directed to predefine manually establishes category attribute database, It is used for carrying out the correction to prediction.

Table 1 category associations exemplary properties

33) deeply understood the first sentence of entity description by interdependent parser, the most precisely identify entity class；

A word that physical page in link data (such as: wikipedia, Baidupedia etc.) describes is the most right The qualitative description of entity (such as: pounding six is a kind of snipsnapsnorum being popular in Tianjin).If able to deeply understand entity The first sentence described, then will be to precisely identifying that entity class has the biggest help.

Fig. 3 is the FB(flow block) that in offer method of the present invention, first sentence deeply understands step.The present invention is first with interdependent sentence Method analyzer finds out the judgement sentential object that physical page text describes in first sentence, then utilizes this judgement sentential object analysis entities The classification of the page；Specifically include following steps:

331) sentential object identification is judged

Present invention utilizes the interdependent parser of Stanford University, the first sentence of entity description is carried out interdependent syntax and divides Analysis, analyzes subject, predicate and the object in sentence of informing against.If the object of the first sentence obtained by interdependent syntax and "Yes" have directly Dependence, then this object is referred to as " judgement sentential object "；Otherwise, this object is referred to as " non-judgement sentential object ".

As the object of the first sentence described in sporocarp text is for judge sentential object, and we can utilize this object to determine for clue The classification of entity, thus verify that the result that grader is predicted is the most accurate.If the result of grader prediction and punctuate object institute The conclusion contradiction drawn, then utilize the prediction of this modified result grader.If first sentence does not exist judgement sentential object, then skip This step, enters 34).

Such as, in " pounding six is a kind of snipsnapsnorum being popular in Tianjin " sentence, interdependent syntactic analysis interpretation of result Obtaining " game " is this sentential object, and " game " has direct dependence with "Yes", then " game " is this judgement Sentential object.It is predefined entity class in entity classification system if " played ", then we are used as this entity with it Classification.

332) utilization judges sentential object correction class prediction

In some cases, even if we have found judgement sentential object, can not arbitrarily be used for prediction is modified, because of For some unnecessary mistakes so may be introduced.Meanwhile, under many circumstances, it is judged that sentential object Incomplete matching classification Title.Such as: " wild damp refined son is that Japanese famous sound is excellent ", although it is sentencing of the words that interdependent syntactic analysis can obtain " sound is excellent " Disconnected object, but predefined entity class does not likely have " sound is excellent " this classification.To this, invention defines correction Condition, utilizes Similarity of Words, i.e. cosine similarity between term vector, finds judgement from extensive un-annotated data The classification that sentential object is most like, reliably carries out classification correction.

In natural language processing field, cosine similarity is generally regarded as the semantic similarity of vocabulary.Specifically, this Bright embodiment is first with use word2vec kit (https: //word2vec.googlecode.com/svn/ Trunk/) at Gigaword Chinese language material (Chinese Gigaword is disclosed data set) upper training Chinese word vector, instruction is utilized The term vector got is found and judges the semantic most like item name of sentential object.If it is determined that sentential object is most like with it The cosine similarity of the term vector of classification is more than presetting threshold value (in the embodiment of the present invention, by calculating cosine similarity Method, cosine similarity threshold value is set as 0.9), just the classification of this entity is modified to most like classification.

To this end, the judgement sentential object that we define the description of physical page kept man of a noblewoman's sentence text is w₀, classifier is that (Y is for real for y ∈ Y Body category set), sim (w₁,w₂) it is word w₁、w₂The cosine similarity of term vector.So correction conditions is that formula 2 is such as shown:

y^*=argmax_y∈Ysim(w₀,y)∧sim(w₀,y^*) > 0.9 (formula 2)

In formula 2, ^ represent and (with) relation；The content (left side item) of ^ forward part shows that y* is that semantic similarity is the highest Classification, the content (the right item) of ^ rear section represents that the similarity of y* Yu w0 needs higher than 0.9；Correction conditions (formula 2) meets just to enter Row is revised, the most only when the similarity that y* is classification that semantic similarity is the highest and y* Yu w0 needs higher than 0.9, with y* Revise original class prediction.

In previous example (" wild damp refined son is that Japanese famous sound is excellent "), we can find out the class most like with " sound is excellent " Be not " performer " (as shown in table 2, table 2 be utilize that the term vector trained from Chinese gigaword calculates with classification phase As some vocabulary, wherein runic vocabulary shows that the similarity of these vocabulary and classification is more than 0.9), and find " performer " and " sound Excellent " semantic similarity more than 0.9, therefore, the classification of " wild damp refined son " this physical page is modified to " performer ".

The most like vocabulary of table 2 classification

34) puzzled matrix identification difficulty sample is used

In actual applications, we are frequently encountered the sample of some classification and are difficult to differentiate between, and this kind of sample is referred to as difficulty sample This.Such as the entity of " city " and " sight spot " two classifications, grader often does the prediction made mistake, because this two class Description and the message box attribute of entity are the most much like.In order to improve the precision of classification, the present invention uses puzzled matrix to find out The sample class that classificating word is easily made mistakes.Specifically, if on checking collection, statistical classification model is for a certain entity class y_iPrecision of prediction be not up to 90%, then classification y_iIt is considered difficulty sample class.Such as, on checking collection, statistical classification model 18 physical page are predicted as " city " classification, but wherein only having 15 pages is " city " classification really, therefore statistical Class model is only 83.33% (15/18) at the precision of prediction of " city " classification, and " city " classification is identified as difficulty sample class Not.For those by the sample that statistical classification model prediction is difficulty sample class, we term it difficulty sample.

For the difficult sample identified, we make use of following two method to verify result.

341) link analysis

For difficulty sample, depend merely on the content in physical page and may be not enough to make correct judgement, therefore, the present invention Have employed link analysis method and difficulty sample is carried out classification results checking.

In link data, physical page would generally be linked to other physical page relative.Generally come Saying, the classification of its other physical page being linked to is very likely to the identical of classification with itself.Therefore, one is utilized The classification of other physical page that physical page is linked to, can preferably judge the classification of this entity with help system.

Specifically, for certain physical page e, we analyze the physical page that e is linked, and its set is designated as N (e).N E () understands some page have classification markup information.The present invention finds out the page having classification to mark in N (e), and counts this Classification y* that a little pages are most, it is judged that the category is the most consistent with the class prediction y ' that e is made by grader.As result differs Cause, utilize y* to revise the result of y '.

342) affix analysis

The sample being difficult to differentiate between for some, the present invention also uses affix analysis method to verify its classification results.For Some classification, its entity name generally ends up with fixing Chinese character.Such as, " city " entity generally ends up with " city, county ", " sight spot " Entity would generally end up with " lake, mountain " etc..Table 3 lists the example of classification common solid affixe.

The classification affixe of table 3 common solid

Sight spot	City
		Mountain, lake, city, road, scape, river, ridge, hole	District, city, county

Present invention firstly provides and utilize the affixe information being associated without labeled data learning object type on a large scale, specifically come Saying, we utilize term vector kit word2vec to train term vector on Chinese Gigaword data set, then by calculating The method of cosine similarity, finds out each classification the most close semantic word (word of term vector cosine similarity more than 0.7).So After, by respectively the affixe of the most close vocabulary at the two sight spot being carried out frequency statistics, it is possible to obtain difficulty sample class The affixe being associated, thus by analyzing affixe, determine its generic.Specifically, if a certain physical page affixe s In a certain classification y₁In frequency be significantly higher than (more than 2 times) another category y₂In the frequency of occurrences, then we are by y₁Real as this Body classification correction is original to predict the outcome.For example, for " Mount Lushan Fairy Cave " physical page, its affixe " hole " occurs in " scape Point " frequency of classification apparently higher than occurring in the frequency of " city " classification, therefore the prediction classification of this entity is modified to " scape Point ".

It should be noted that publicizing and implementing the purpose of example is that help is further appreciated by the present invention, but the skill of this area Art personnel are understood that various substitutions and modifications are all without departing from the present invention and spirit and scope of the appended claims Possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Book defines in the range of standard.

Claims

1., towards an entity classification method for link data, described link data are multiple physical page, described physical page Comprise text to describe and message box；Described entity classification method includes pretreatment stage, statistical classification stage and post-processing stages, Specifically include following steps:

1) in pretreatment stage process, carrying out participle by the text in physical page is described information, cutting obtains word information； The feature of physical page it is made up of the attribute-name of message box and described word information；

2) in the statistical classification stage, utilize the feature of described physical page, use multiple cutting granularity to train statistical classification mould Physical page is classified by type, obtains the tentative prediction result of entity class；

3) in post-processing stages, the tentative prediction result of entity class is modified, obtains revised entity classification classification； Described correction comprises the steps:

31) by many granularity models fusion method, the entity that the statistical classification model using the training of multiple cutting granularities is obtained The tentative prediction result of classification merges, the entity class result after being merged；

32) build category attribute database, utilize the category associations attribute information in category attribute data Kuku, after merging Entity class is modified, and obtains the revised entity class of category associations attribute；

33) utilizing parsing method parsing sentence structure, carrying out deep understanding step 32 by text being described first sentence) gained The revised entity class of category associations attribute arrived, obtains first sentence deep understanding revised entity class information.

2., as claimed in claim 1 towards the entity classification method of link data, it is characterized in that, step 1) described segmenting method bag Maximum matching process before and after including, backward maximum matching process and based on statistical series mask method.

3., as claimed in claim 1 towards the entity classification method of link data, it is characterized in that, step 2) use two kinds of cutting grains Degree, is respectively the cutting granularity with name Entity recognition and the cutting granularity without name Entity recognition.

4. as claimed in claim 1 towards the entity classification method of link data, it is characterized in that, described statistical classification model is Big entropy model；Step 31) described many granularity models fusion method is calculated the different cutting grain-size classification of fusion especially by formula 1 The probability distribution of device prediction, carries out classification by the maximum entropy disaggregated model that multiple cutting granularities are trained to physical page and obtains entity Category result merges:

P_multi(y | x)=λ P_w(y|x)+(1-λ)P_n(y | x) (formula 1)

In formula 1, P_multi(y | x) for merging the probability distribution of different cutting grain-size classification device prediction；P_w(y | x) for only using word segmentation Probability distribution sample x predicted as feature maximum entropy disaggregated model；Y is sample class, and x is sample；P_n(y | x) be The name entity mark probability distribution as the maximum entropy prediction of feature is added on the basis of word segmentation；λ is to adjust linear interpolation power The parameter of weight.

5., as claimed in claim 1 towards the entity classification method of link data, it is characterized in that, step 33) described utilize grammer Analytical sentence structure, obtains first sentence deep understanding revised entity class information, specifically includes following steps:

331) the first sentence of entity description is carried out interdependent syntactic analysis, identify whether the object of first sentence belongs to judgement sentential object；

332) on extensive un-annotated data, train Chinese word vector, define Similarity of Words, calculate term vector and sentence The Similarity of Words of punctuate object, obtains the term vector that Similarity of Words is the highest；

333) by cosine similarity computational methods, cosine similarity threshold value is set, when judging sentential object classification most like with it The cosine similarity of term vector more than cosine similarity threshold value, the classification of this entity is modified to most like classification.

6., as claimed in claim 1 towards the entity classification method of link data, it is characterized in that, to reality processing stage of in the rear The tentative prediction result of body classification is modified, and after obtaining revised entity classification classification, uses puzzled matrix to identify Difficulty entity class；For the difficult entity class identified, by link analysis method and affix analysis method to entity class Other result is verified；Described puzzled matrix recognition methods is specifically: on checking collection, when statistical classification model is for a certain reality Body classification y_iPrecision of prediction when being not up to 90%, classification y_iIt is considered difficulty entity class.

7. as claimed in claim 6 towards the entity classification method of link data, it is characterized in that, described link analysis method is concrete It is: set class prediction that physical page e made by grader as y ', the collection of physical page physical page e linked Conjunction is designated as N (e), finds out the page having classification to mark in N (e), adds up the class that the page obtaining having classification to mark in N (e) is most Not, it is denoted as y*；When classification y* is inconsistent with class prediction y ', utilizes y* to revise the result of y ', obtain the class of physical page e Wei y*.

8. as claimed in claim 6 towards the entity classification method of link data, it is characterized in that, described affix analysis method is concrete It is: the entity class ended up with fixing Chinese character for entity name to utilize the entity class obtained on a large scale without labeled data study The affixe information that type is associated, by respectively the affixe of the most close vocabulary being carried out frequency statistics, obtains difficulty entity class phase The affixe of association, obtains the classification of described entity by analyzing affixe.

9. utilize the entity towards link data realized described in claim 1～8 towards the entity classification method of link data Categorizing system, is characterized in that, including pretreatment module, statistical classification module and post-processing module；

Described pretreatment module carries out participle, by message box attribute-name and participle for the text in physical page is described information The word information obtained as feature extraction out, as the character representation of physical page；

Described statistical classification module, by using maximum entropy sorting algorithm to carry out train classification models, utilizes in physical page entity Description information identification obtain entity class；

Described post-processing module is used for using the fusion of many granularity models, category associations attribute and first sentence deeply to understand described statistics The entity class that sort module obtains is modified, and obtains revised entity class.

10. as claimed in claim 9 towards the entity classification system of link data, it is characterized in that, described participle instrument is Stanford CoreNLP kit；Described disaggregated model uses maximum entropy classifiers software kit Maxent.