CN109815401A

CN109815401A - A kind of name disambiguation method applied to Web people search

Info

Publication number: CN109815401A
Application number: CN201910061520.6A
Authority: CN
Inventors: 张军; 胡欣; 占梦来; 邹佩良; 王另
Original assignee: Sichuan Chengzhi Hearing Technology Co Ltd; University of Electronic Science and Technology of China
Current assignee: Sichuan Chengzhi Hearing Technology Co Ltd; University of Electronic Science and Technology of China
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2019-05-28

Abstract

The present invention discloses a kind of name applied to Web task search and disappears qi method, comprising: S1, extracts html web page source code, and takes out noise wherein unrelated with people information；S2, personage's web page characteristics collection is extracted；S3, personage's web page characteristics collection that step S2 is extracted is generated to the assemblage characteristic vector for representing some personage's related web page；S4, hierarchical clustering is carried out using Agglomerative Hierarchical Clustering algorithm, obtains personage's website construction result；Method of the invention capitalizes the introducing of model by n member, solves the limitation of traditional name Entity recognition, name entity extraction is limited, can not be to the extraction of peculiar vocabulary, proprietary vocabulary many in text；Different weights is assigned according to its importance to personage's characterization by the different characteristic to extraction, improves the accuracy of name disambiguation.

Description

A kind of name disambiguation method applied to Web people search

Technical field

The invention belongs to photoelectricity transmission field, in particular to a kind of all -fiber distribution sound wave sensing technology.

Background technique

With the arrival of mobile internet era, search engine becomes the important tool that people obtain knowledge, in internet Upper search people information is very common thing.There are about the search engine inquiries of 5%-10% to be related to name according to statistics, and It is ready when searching for name only less than 20% people plus additional information.Meanwhile name has very high ambiguousness, according to beauty The report display of office of Population Censuses and Surveys, state, has 1,000,000,000 people but only to use 90000 different names.Search engine name is retrieved to obtain Be multiple tasks of the same name related web page mixing resultant, and have the tendency that flooding " non-famous person " there are also " famous person " webpage. Such as Google search " Michael Jordan ", it as a result can be related to several different people entities, such as basketball star, university The webpage of professor, film performer etc., occurred name " Michael Jordan " can all be shown.And front is shown all It is the obvious Jordan of basketball, such result is unsatisfactory.So requirement of the people to task search is universal and urgent , the critical issue of people search is that personage's webpage of the same name is separated by individual, and key problem is that name disambiguates, and is also known as attached most importance to Name disambiguates.

In recent years, researcher to name disambiguation begun with more concerns, be initially intended only as entity public affairs refer to problem into Row research is that name disambiguation is defined as clustering problem now.It mainly includes following several method that Web name, which disambiguates: being based on net Network classification of knowledge resources method, based on figure segmentation clustering method and based on the clustering method of vector space model.

(1) it is based on network knowledge resource classification method

Based on network knowledge resource classification method using open resource existing on network, distinctive classification system is constructed, So that the stronger social property of discrimination in these classifications and people information in the real world is established corresponding relationship, then personage is pressed Its social property is divided into different classes of, to achieve the purpose that disambiguation.

More typical this method is to extract professional directories, collects the relevant documentation conduct of the moderate a variety of occupational classifications of granularity Training data, it is assumed that a kind of corresponding personage of occupation, then by the occupational classification system of each document classification to real world In, and then the similarities and differences of personage in each document are judged by the similarities and differences of occupation.

(2) clustering method based on figure segmentation

Clustering method based on figure segmentation is constructed using document as vertex, using the connection between document as the figure on side, then Cluster is completed by the method for figure segmentation.

Wherein typically the name based on social network disambiguates, it will assume each duplication of name person using same name It is belonging respectively to overlap between the information that different circle or duplication of name person are embodied in internet although circle has and overlaps few； And among the same circle, the association between personage and information is but very close.Under such hypothesis, such methods are by document It is considered as node: is considered as side by the linking relationship between internet document or more than the relationship of the name co-occurrence of threshold number；Construction Social network out, and application drawing dividing method clusters network, obtains different circles；If several documents belong to together One circle, then it is assumed that the same name wherein occurred refers to the same personage of real world.

(3) based on the clustering method of vector space model

Clustering method based on vector space model is initially the coreference resolution to solve the problems, such as more document names^[1]It uses Vector space model.System generates a reference chain about each document first；Sentence relevant to the reference chain is extracted again Son is generated and is then exported about the abstract of each document；Final system calculates the similitude between the abstract of each document, similar Property be greater than specific threshold in experiment, be regarded as referring to the same people entities, this two document is divided into same In a cluster.

The later method based on vector space model is also to extract feature vector, carried out using standard vector space model The thinking of cluster, this method groundwork concentrate on feature extraction, network resource usage and clustering method.

Existing Web name, which disambiguates technology, some disadvantages, is such as needed based on network knowledge resource classification method of the same name The communication circle that personage cannot possess between identical occupation, the clustering method requirement personage of the same name based on figure segmentation is less overlapped Deng.In addition, for the clustering method based on vector space model, the reason of feature selecting and processing before and feature The reason of collecting fusion method lead to have certain office using the clustering method progress Web name disambiguation based on vector space model It is sex-limited.

Summary of the invention

In order to solve the above technical problems, the present invention proposes that a kind of name applied to Web people search disappears qi method, in conjunction with The feature that N member capitalizes model combines name disambiguation method, realizes network name check disambiguation.

The technical solution adopted by the present invention are as follows: a kind of name applied to Web task search disappears qi method, comprising:

S1, html web page source code is extracted, and takes out noise wherein unrelated with people information；It is described with people information without The noise of pass includes at least: label, script script in html web page source code disambiguate useless navigation menu to name, are right Name disambiguates useless advertisement.

S2, personage's web page characteristics collection is extracted；Personage's web page characteristics collection, comprising: webpage URL, web page title and abstract, Web page text, name entity and n member capitalize model.

S3, personage's web page characteristics collection that step S2 is extracted is generated represent the assemblage characteristic of some personage's related web page to Amount；

S4, hierarchical clustering is carried out using Agglomerative Hierarchical Clustering algorithm, obtains personage's website construction result.

Further, step S3 includes:

S31, personage's web page characteristics collection that step S2 is extracted is modeled using vector space model, obtains webpage URL The feature set and n member of feature set, name entity generation that feature set, web page title and abstract feature set, Web page text generate Capitalize the feature set that model generates；

S32, according to each feature set in step S31, the tectonic association feature vector by the way of linear weighted function.

The linear weighted function disambiguates the weighting coefficient of each feature set to name according to personage's web page characteristics collection and indicates The percentage contribution of effect determines.

It further, further include using TF-IDF statistical method to the assemblage characteristic vector in step S32 to its weight Value carries out re-optimization, obtains the assemblage characteristic vector of some personage's related web page of final representative.

The assemblage characteristic vector of some personage's related web page of final representative, expression formula are as follows:

Wherein, w_iIndicate the weight redefined, 1≤i≤m', tf_iIt isIn certain keyword the frequency of occurrences, N_dIt is wait disappear Discrimination personage web document sum, df_iThere is keyword k in expression_iNumber of files, m' indicates the Feature Words number after final combination.

The weight w redefined_iCalculating formula are as follows:

Beneficial effects of the present invention: present invention introduces n member capitalization models then, will be pre- by pre-processing to HTML Treated, and text is used to carry out feature extraction, extracts multiple features, and it is important as webpage one to joined n member capitalization model Feature；The method being weighted according to feature significance level to feature vector is introduced, using different characteristic to the characterization journey of personage Degree is different, assigns different weights to its feature；Construction web page characteristics vector will be met similarity threshold and wanted by hierarchical clustering The personage's webpage collection asked melts as a class, finally obtained result be exactly Web name disambiguate as a result, by personage's webpage of the same name It separates, has the advantages that by entity

(1) introducing for capitalizing model by n member, solves the limitation of traditional name Entity recognition, name entity extraction It is limited, it can not be to the extraction of peculiar vocabulary, proprietary vocabulary many in text.

(2) different weights is assigned according to its importance to personage's characterization by the different characteristic to extraction, improved The accuracy that name disambiguates.

Detailed description of the invention

Fig. 1 is the solution of the present invention flow chart.

Specific embodiment

The prior art of the present invention is briefly described first:

1, term frequency-inverse document frequency algorithm

TF-IDF (Term Frequency-Inverse Document Frequency), term frequency-inverse document frequency are calculated Method, it is a kind of statistical method, for assessing a words to the important of a certain piece document in a file set or a corpus Property, the directly proportional increase of number that the importance of words occurs hereof with it, but can go out in corpus with it simultaneously Existing frequency is inversely proportional decline.It summarizes, the number that an exactly word occurs in a document is more, while in other institutes There is the number occurred in document fewer, this word gets over the content that can represent this document.

The various forms of TF-IDF weighting is often searched engine application, as degree of correlation between file and user query Measurement or grading.Other than TF-IDF, the search engine on internet also will use the ranking method based on link analysis, with Determine the sequence that file occurs in search result.

TF, i.e. word frequency.Word frequency refers to the number that some specified word occurs in a document, in order to avoid word frequency is inclined To lengthy document (the same word may in lengthy document than occur in short essay shelves often, but regardless of it is important whether), so with The number that word occurs is than total word number of document as normalizing formula with article that prevent it to be biased to long.The calculation formula of TF is such as Under:

IDF, i.e., reverse document-frequency.Some general terms can largely occur in each document, be calculated with TF formula The weight come is certainly very big, but such word can not react the theme of a document, it would be desirable to which those are in a document The few words that are more and occurring in other documents occurred, this kind of words could reflect document subject matter, it is clear that TF is not accomplish This point, and reverse document-frequency can accomplish this point just.If the document comprising some entry is fewer, IDF is bigger, Then illustrate that entry has good class discrimination ability.The IDF value of some entry is calculated as follows:

Denominator adds 1 to be that denominator is 0 in order to prevent.Frequent words and the word in a certain document are in entire document sets Low document-frequency in conjunction can produce the TF-IDF value of high weight, and therefore, TF-IDF tends to filter out common word, retain Important word.TF-IDF=TF × IDF

TFIDF algorithm is built upon on such a hypothesis: should be that the difference most significant word of document The frequency of occurrences is high in a document a bit, and the few word of the frequency of occurrences in other documents of entire collection of document, so if special Sign space coordinates take the conduct of TF word frequency to estimate, so that it may the characteristics of embodying with class text.In addition in view of word difference is different The ability of classification, as soon as the text frequency that TFIDF method thinks that a word occurs is smaller, the ability that it distinguishes different classes of text is got over Greatly.Therefore the concept for introducing inverse text frequency IDF, is estimated using the value of the product of TF and IDF as feature space coordinate system, And the adjustment to weight TF is completed with it, the purpose for adjusting weight is prominent important words, inhibits secondary word.But at this IDF is a kind of weighting for attempting to inhibit noise in matter, and merely thinks that the small word of text frequency is more important, text frequency The big word of number is more useless, it is clear that this is not right-on.The simple structure of IDF can not effectively reflect word The distribution situation of significance level and Feature Words makes it that can not complete the function to weighed value adjusting well, so the essence of TFIDF method Degree is not very high.In addition, do not embody the location information of word in TFIDF algorithm, and for Web document, power The calculation method of weight should embody the structure feature of HTML.Feature Words are in different marker characters to the reflection journey of article content Degree is different, the calculation method of weight also Ying Butong.Therefore the Feature Words in webpage different location should be assigned respectively Different coefficient, then multiplied by the word frequency of Feature Words, to improve the effect of text representation.

2, Agglomerative Hierarchical Clustering

Hierarchical clustering is a kind of clustering algorithm based on prototype, it is intended to data set divided in different levels, thus Form tree-like cluster structure.The advantage of hierarchical clustering algorithm is that it does not need the quantity of specified cluster in advance.It is poly- to agglomerate level Class first regards each sample as a different cluster using the thought of " bottom-up ", will be nearest by repetition A pair of of cluster merge, until to the last all samples belong to the same cluster.

Assuming that the name to be disambiguated occurred in same piece web document only corresponds to personage's individual in reality, then name disappears Discrimination can regard hard clustering problem as, and cluster result is not overlapped；Simultaneously as the number of persons with name duplication is unknown and not solid Fixed, problems belong to non-supervisory class problem again, are suitable for Agglomerative Hierarchical Clustering algorithm.Similarity between two documents is by document Included angle cosine between feature vector indicates that similarity is using average distance method between class.Formula is as follows:

When clustering initial, by each of the corresponding webpage collection P=of each name { p1 ..., pi ..., pn } webpage pi As soon as regard the class Ci={ pi } with single member as, so a cluster C={ c1, c2 ..., cn } of P is constituted, it is right It carries out calculating its similarity using feature vector above between class { ci, cj }, then chooses maximum two clusters of similarity It merges, forms a new class, i.e. c_k=c_i∪c_j, hence for p-shaped at new cluster C={ c1, c2 ..., a cn- 1}；Above step is repeated, until the similarity between all clusters is less than some threshold value or all as a cluster.

For convenient for those skilled in the art understand that technology contents of the invention, with reference to the accompanying drawing to the content of present invention into one Step is illustrated.

As shown in Figure 1 be the solution of the present invention flow chart, realization process of the invention the following steps are included:

S1, pretreatment

Clean data set in order to obtain, need to persons with name duplication webpage collection carry out cleaning, removal wherein with personage The unrelated noise of information.Pretreated target be remove html web page source code in label, script script and to name disambiguate Useless navigation menu, the noise informations such as advertisement, extracts the text message of webpage.

Original HTML borrows HTMLPaser kit first and removes label and script script, use and text therein It offers^[3]In similar method, for block grade level label (such as<div>,<p>) only in text block more than ten words Hold and be retained, to remove navigation menu, advertisement etc..

S2, feature extraction

Assuming that personage's individual has the feature of its own, why personage of the same name can be distinguished, be exactly because its with it is many not Same feature；Lexical information equally can reflect name ownership in people entities webpage, and clustering algorithm can pass through these features And similarity judges personage's generic of the same name between feature.Different information sources is needed to take out using different processing strategies Take text feature information therein.

(1) webpage URL

Some instruction information are contained in the URL of webpage, to the treatment process of URL: in URL separator (: /) be Boundary is cut into it independent small character string location, and it is wherein common to disambiguation buzz words symbol to compare self-built dictionary removal String, such as http, www etc..Cardinar number word string, additional character can be equally removed.

(2) web page title and abstract title are the texts to Web page text summing-up, and abstract is returned by search engine Summary info is the high level overview to inquiry theme word information, therefore the two compares text more to the indicative function that name disambiguates Accurately.Use general text handling method to title and abstract: text participle, normalization, removal stop words, stem extract, Finally obtain the stem set containing repeated vocabulary.

(3) Web page text

The processing intent of Web page text is to convert vocabulary therein to more regular format, due to single plural number, tense The presence of the grammers such as voice, English have conjugations abundant, use to the processing of text and title and pluck side to be processed Method is similar, requires to do stem extraction.

(4) entity is named

Name entity is to discriminate between piece identity's important feature, personage's life place, occupation and work unit's title etc. A personage can be identified well.Use name, place name, the company organization etc. occurred in NER tool identification Web page text Name entity.Since name entity is mostly proper noun, so only need uniformly to be converted into small letter, does not need to do stem and mention It takes.Also other special word string information such as the mailbox occurred in text, numeric string are identified by defining the help of regular expression rule.

(5) n member capitalizes model

It is feature important in text that n member, which capitalizes model, since name Entity recognition type is limited, so selection is continuous The vocabulary of n capitalization beginning is all met this requirement as feature vocabulary, such as movie name, awards etc., so we It extracts n member and capitalizes model, form a word lists, used as a webpage personage feature.

N member (n-gram) capitalization model is a kind of algorithm based on statistical language model.Its basic thought is by text The content of the inside carries out the sliding window that size is n according to byte and operates, and forms the byte fragment sequence that length is n.It is each A byte segment is known as gram, counts to the occurrence frequency of all gram, and carries out according to the threshold value being previously set Filtering forms key gram list, that is, the vector characteristics space of this text, each gram in list is exactly one Feature vector dimension.

S3, feature combination

For personage's web page characteristics collection of extraction, needs to model it, be translated into what computer was understood that Mathematical model just conveniently further processes.

It is modeled respectively using the characteristic information that vector space model (VSM) extracts context first, i.e. feature set It is represented as having the multi-C vector of the keyword composition of certain weighted value, the weight of keyword is set as its going out in feature set Existing frequency.Assuming that modeling situation is as follows:

Webpage URL feature set f1:

Web page title and abstract feature set f2:

The feature set f3 that Web page text generates:

The feature set f4 for naming entity to generate:

N member capitalizes the feature set f5 that model generates:

Wherein, tf is the frequency of occurrences of the keyword k in feature set.

Simultaneously by five independent characteristic Vector Fusions tectonic association feature vector together by the way of linear weighted functionWeighting coefficient λ₁、λ₂、λ₃、λ₄And λ₅Instruction is disambiguated to name according to feature set to make Percentage contribution setting, specific value manually provide preferably result relatively by experiment.Such as believe relative to Web page text Name entity and n member the capitalization aspect of model of breath, webpage have stronger disambiguation indicative function, so corresponding feature set weights Coefficient should be higher than the weighting coefficient of Web page text.

For fused new feature vectorRe-optimization is carried out to its weighted value using TF-IDF statistical method.Therefore, The assemblage characteristic vector of some personage's related web page of the representative ultimately generated are as follows:

Wherein,tf_iIt isIn certain keyword the frequency of occurrences, N_dIt is wait disappear Discrimination personage web document sum, df_iThere is keyword k in expression_iNumber of files.

S4, hierarchical clustering

By using Agglomerative Hierarchical Clustering algorithm (HAC), the similarity between two documents is by between file characteristics vector Included angle cosine indicates that between class similarity is using average distance method.When clustering initial, the corresponding document sets D of each name ={ d₁,d₂,...,d_nRegard the class with single member as, therefore constitute initial clustering C={ c₁,c₂,...,c_n}；Meter Calculate class (c_i,c_j) between similarity, choose maximum two classes of similarity merge, generate new class c_m, to constitute D A new cluster C={ c₁,c₂,...,c_n-1}；It repeats the above steps, until similarity is less than given phase between all classes Gather like degree threshold value beta or whole for one kind.

The present invention can effectively and accurately carry out name and disambiguate work.By taking WePS data set as an example, by using above-mentioned side Method disambiguates validity check to carry out name.When the progress name disambiguation of feature f1, f2 is used only, purity P is fine, but inverse purity IP Lower, although illustrating that the personal information in feature f1, f2 is more accurate, since information content is few, data are caused inverse pure than sparse It spends lower.Its purity reduces while being increased using the inverse purity of the disambiguation of feature f3, containing a large amount of mainly in text Some noises are also introduced while personal information.It is also relatively high for the disambiguation result purity of feature f4 name entity, it is inverse Purity is relatively low, it may be possible to which the influence of NER tool recognition accuracy causes to name entity less.So we introduce n member greatly This feature of model is write, as the supplement of name substance feature, name entity is all the vocabulary group of initial caps, selection mostly More continuous initial caps vocabulary group improving performances.

Different features has the characteristics that different, and the shortcoming of a certain category feature is used alone by combining to make up, Based on the disambiguation of f1+f2+f3+f4+f5 while guaranteeing that cluster result purity is appropriate its also increase against purity, it is different special It is as shown in table 1 to levy combined Clustering Effect.

The Clustering Effect of 1 different characteristic of table combination

Cluster feature	Purity P	Inverse purity IP
			f1+f2	0.80	0.63
f3	0.74	0.70
			f4	0.78	0.65
f1+f2+f3+f4+f5	0.88	0.73

Those of ordinary skill in the art will understand that the embodiments described herein, which is to help reader, understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such specific embodiments and embodiments.For ability For the technical staff in domain, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made Any modification, equivalent substitution, improvement and etc. should be included within scope of the presently claimed invention.

Claims

A kind of qi method 1. name applied to Web task search disappears characterized by comprising

S1, html web page source code is extracted, and takes out noise wherein unrelated with people information；

S2, personage's web page characteristics collection is extracted；

S3, personage's web page characteristics collection that step S2 is extracted is generated to the assemblage characteristic vector for representing some personage's related web page；

S4, hierarchical clustering is carried out using Agglomerative Hierarchical Clustering algorithm, obtains personage's website construction result.
The qi method 2. a kind of name applied to Web task search according to claim 1 disappears, which is characterized in that step The noise unrelated with people information described in S1 includes at least: label, script script in html web page source code disappear to name The useless navigation menu of discrimination disambiguates useless advertisement to name.
The qi method 3. a kind of name applied to Web task search according to claim 2 disappears, which is characterized in that step Personage's web page characteristics collection described in S2, comprising: webpage URL, web page title and abstract, Web page text, name entity and the capitalization of n member Model.
The qi method 4. a kind of name applied to Web task search according to claim 3 disappears, which is characterized in that step S3 includes:

S31, personage's web page characteristics collection that step S2 is extracted is modeled using vector space model, obtains webpage URL feature The feature set and the capitalization of n member of feature set, name entity generation that collection, web page title and abstract feature set, Web page text generate The feature set that model generates；

S32, according to each feature set in step S31, the tectonic association feature vector by the way of linear weighted function.
The qi method 5. a kind of name applied to Web task search according to claim 4 disappears, which is characterized in that step Linear weighted function described in S32 disambiguates indicative function to name according to personage's web page characteristics collection to the weighting coefficient of each feature set Percentage contribution determines.
The qi method 6. a kind of name applied to Web task search according to claim 5 disappears, which is characterized in that also wrap It includes and re-optimization is carried out to its weighted value using TF-IDF statistical method to the assemblage characteristic vector in step S32, obtain final Representative some personage's related web page assemblage characteristic vector.
The qi method 7. a kind of name applied to Web task search according to claim 6 disappears, which is characterized in that final Representative some personage's related web page assemblage characteristic vector, expression formula are as follows:

Wherein, w_iIndicate the weight redefined, 1≤i≤m', m' indicate the Feature Words number after final combination, tf_iIt isIn certain The frequency of occurrences of keyword, N_dIt is personage's web document sum to be disambiguated, df_iThere is keyword k in expression_iNumber of files.
The qi method 8. a kind of name applied to Web task search according to claim 7 disappears, which is characterized in that