CN105095400A

CN105095400A - Method for finding personal homepage

Info

Publication number: CN105095400A
Application number: CN201510394587.3A
Authority: CN
Inventors: 唐杰; 刘德兵; 杨宏; 袁慧
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2015-07-07
Filing date: 2015-07-07
Publication date: 2015-11-25
Anticipated expiration: 2035-07-07
Also published as: CN105095400B

Abstract

The invention discloses a method for finding a personal homepage, which comprises following steps: entering key information in the search engine to acquire search results, and using the search result most close to the key information as a data set; extracting part of data text from the data set to mark; dividing the marked data text into a training set and a test set; extracting training set feature information for the training set; establishing a model for the training set feature information to acquire a first model; extracting test set feature information for the test set; analyzing the test set feature information by utilizing the first model to acquire a prediction result; judging the prediction result; selecting an optimal model through iteration with 10-fold cross-validation; and utilizing the optimal model to judge whether the search result is the personal homepage of a target person or not. The method provided herein has the advantages that the practicality is strong; through the actual application process, a training sample can be collected and settled and the training set can be upgraded and expanded, thus further improving the practicality and the finding accuracy of the method.

Description

The lookup method of personal homepage

Technical field

The present invention relates to computer technology and web-information technology field, be specifically related to a kind of lookup method of personal homepage.

Background technology

It is information retrieval field very important aspect that expert finds ^[1].From the experts database of National Nature fund committee to responsible reader's commending system of international conference, and a lot of application such as ordinary people's medical web site doctor recommendation function that can touch all needs huge experts database support.Particularly government has put into effect again " general office of the Department of Science and Technology supplements the notice of national science and technology experts database expert info about improving " in the recent period, can find out and improve expert info, build experts database and have great importance.But experts database construction, especially to some large-scale experts databases more than ten thousand people, but the maintenance update of expert's personal information is one takes time and effort very much very important task.The accuracy of expert's personal information and the service quality of completeness on experts database have important impact.Along with the universal of internet and development, Many researchers all establishes personal homepage and keeps personal information real-time update, and this is the important channel of quick obtaining expert personal information.In this patent, we have proposed a kind of high-accuracy and adaptable personal homepage automatic searching method.The method combining information Automatic Extraction technology and manually mark work, greatly can improve expert's personal information in experts database and upgrade efficiency, and then improve the service quality of experts database.Personal homepage is searched.

Personal homepage is searched, and namely for the people of a given name and work unit, finds the page comprising its personal information from the magnanimity information of internet, and this page can be the web page that it oneself is set up, also can be place working mechanism set up introduce the page.There are some similar researchs at present, having utilized SVM to search building the useful page of community network as the people such as Zuo Nan mention under study for action ^[2].Although method is similar, its personal homepage is more specific compared to the useful page, more difficult excavation; The people such as Tang Jie ^[3,4]although research specific to personal homepage aspect, only stop at page classifications, but may error be there is due to the make a summary restricted and artificial mark of number of words of Google, extract result and still have much room for improvement.

In this patent, for feature and the deficiency that worked of personal homepage in the past, we have proposed the personal homepage lookup method of a kind of rule in conjunction with machine learning.First the method utilizes Google search engine to obtain the quality data source that may comprise personal homepage, afterwards artificial annotate portions data.Because be all likely the personal homepage expected concerning any one webpage, also likely not, so searching of personal homepage can regard two classification problems as.Adopt the data of support vector machines more classical in sorting algorithm to mark to carry out training study in patent and obtain comparatively ideal model, finally in conjunction with predefined rule-based filtering, thus find out the personal homepage of expectation.The method effectively solves the problem that the classification accuracy that causes is not high enough due to the limitation of Google Search Results reflection web page contents.

List of references:

[1] Liu Jian, Li Qi, Liu Baohong, Zhang Yun is based on expert's discover method National University of Defense technology journal Vol35 of topic model, No.22013

[2] left south, Li Juanzi, Tang Jie extract the 3rd Universal Information retrieval and content safety academic conference 2007 based on the portrait photo of SVM

[3]J.Tang,L.Yao,D.Zhang,andJ.Zhang.Acombinationapproachtowebuserprofiling.ACMTKDD,5(1):1–44,2010.

[4]J.Tang,J.Zhang,L.Yao,J.Li,L.Zhang,andZ.Su.Arnetminer:Extractionandminingofacademicsocialnetworks.KDD,pages990–998,2008

Summary of the invention

The present invention is intended at least one of solve the problems of the technologies described above.

For this reason, the object of the invention is to the lookup method proposing a kind of personal homepage.

To achieve these goals, the embodiment of a first aspect of the present invention discloses a kind of lookup method of personal homepage, comprise the following steps: A: input key message in a search engine and obtain Search Results, adopt the Search Results of the first predetermined number of closest described key message in described Search Results as data set; Whether B: manually mark from described data centralization extracting part divided data text is the personal homepage of target person for distinguishing; C: the training set of the second predetermined number and the test set of the 3rd predetermined number are divided into the described data text marked; D: training set characteristic information is extracted to described training set; E: utilize SVM to carry out modeling to described training set characteristic information, obtain the first model; F: test set characteristic information is extracted to described test set; G: utilize described first model to analyze described test set characteristic information, predicted the outcome; H: the personal homepage judgment rule according to presetting judges described predicting the outcome; I: adopt ten folding cross validation methods to carry out iteration to step C to step H, choose optimization model; J: whether described Search Results is the personal homepage of target person to adopt described optimization model to judge.

According to the lookup method of the personal homepage of the embodiment of the present invention, someone personal homepage can be found fast and accurately according to given simple information, and then the details can extracting this person by automatic algorithms or artificial mask method comprise contact method (mailbox, phone, address etc.), individual brief introduction, research interest, bear project, paper list etc.These details set up as expert think tank, the essential condition of the talent banks such as evaluation expert storehouse, and simultaneously the complete degree of these information is for such as experts recommend, and responsible reader recommends the effect of application service such as grade to have material impact.Have now much large-scale talent bank such as Natural Science Fund In The Light evaluation expert storehouse to have more than 140,000 people, but the information updating of these experts safeguard that be one takes time and effort very important engineering very much.The personal homepage lookup method of the application embodiment of the present invention, in conjunction with automated information retrieval algorithm, can improve the renewal efficiency of talent bank personal information greatly, for the real-time keeping talent bank information, improves talent bank service quality significant.

In addition, the lookup method of personal homepage according to the above embodiment of the present invention, can also have following additional technical characteristic:

Further, in step, described key message comprises: the first search phrase, and described first search phrase comprises target person name and target person unit one belongs to; Second search phrase; Described second search phrase comprises described target person name and homepage; And the 3rd searches for phrase, described 3rd search phrase comprises described target person name and mailbox.

Further, in step D, described training set characteristic information comprises the TFIDF value of each word in described training set, and the computing formula of wherein said TFIDF is:

tfidf(t，d，D)＝tf(t，d)*idf(t，D)

Wherein t is word, and d represents the article that institute's predicate occurs, D is whole corpus, and tf represents word frequency, and IDF represents reverse document-frequency;

{tf}_{i, j} = \frac{n_{i, j}}{Σ_{k} n_{k, j}}

{idf}_{i} = l o g \frac{| D |}{| {j : t_{i} &Element; d_{j}} |}

Wherein, for the word i in arbitrary section of document j, the frequency n that the word frequency tf of institute predicate i occurs in described document j for institute predicate i _{i, j}divided by the total word number in described document; The idf value of institute predicate i is that the number of files of described corpus is divided by the log value of number of files comprising this word; Using the corpus that the title of described Search Results and web-page summarization are independent of each other as two, in respective word space, carry out the calculating of TFIDF value respectively.

Further, described training set characteristic information also comprises part of speech, uses Chinese lexical analysis system to carry out part of speech analysis to the title of Search Results described in every bar, adds up the number of times that various part of speech occurs.

Further, described training set characteristic information also comprises further feature, and described further feature comprises: whether comprise noise word in URL; Described target person name whether is there is in title; And in web-page summarization, there is the position of described target person name.

Further, in step e, adopt SVM-light to set up described first model, in step G, adopt described SVM-light and described first model to analyze described test set characteristic information.

Further, in steph, described personal homepage judgment rule is: if there is any one situation following, then the weight predicted the outcome described in reduces, H1: comprise year, month and day information in described web-page summarization; H2: described target person name occurs more than three times in described web-page summarization; H3: described target person name appears at the latter half of described web-page summarization, and only appear in paper co-worker.

Additional aspect of the present invention and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.

Accompanying drawing explanation

Above-mentioned and/or additional aspect of the present invention and advantage will become obvious and easy understand from accompanying drawing below combining to the description of embodiment, wherein:

Fig. 1 is that the homepage of one embodiment of the invention extracts process flow diagram.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.

In describing the invention, it will be appreciated that, term " " center ", " longitudinal direction ", " transverse direction ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end ", " interior ", orientation or the position relationship of the instruction such as " outward " are based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, instead of indicate or imply that the device of indication or element must have specific orientation, with specific azimuth configuration and operation, therefore limitation of the present invention can not be interpreted as.In addition, term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance.

In describing the invention, it should be noted that, unless otherwise clearly defined and limited, term " installation ", " being connected ", " connection " should be interpreted broadly, and such as, can be fixedly connected with, also can be removably connect, or connect integratedly; Can be mechanical connection, also can be electrical connection; Can be directly be connected, also indirectly can be connected by intermediary, can be the connection of two element internals.For the ordinary skill in the art, concrete condition above-mentioned term concrete meaning in the present invention can be understood.

With reference to description below and accompanying drawing, these and other aspects of embodiments of the invention will be known.Describe at these and in accompanying drawing, specifically disclose some particular implementation in embodiments of the invention, representing some modes of the principle implementing embodiments of the invention, but should be appreciated that the scope of embodiments of the invention is not limited.On the contrary, embodiments of the invention comprise fall into attached claims spirit and intension within the scope of all changes, amendment and equivalent.

Below in conjunction with accompanying drawing, the lookup method according to the personal homepage of the embodiment of the present invention is described.

Fig. 1 is that the homepage of one embodiment of the invention extracts process flow diagram, please refer to Fig. 1.

One, high-quality data set is obtained

Along with the development of internet, increasing information is home-confined just can be obtained from network.Statistics finds, quite a few researcher is at the online personal homepage having oneself, and researcher's relevant information listed on personal homepage builds expert think tank, the essential condition of the talent banks such as evaluation expert storehouse, simultaneously the complete degree of these information is for such as experts recommend, the effect of the application services such as responsible reader recommends has material impact, therefore how to obtain personal homepage data very crucial.Have benefited from the development of search engine, by rational keyword retrieval, just can obtain these data.Search engine popular at present has Baidu, Bing, Google tri-kinds, considers the internationalization of researcher, and the Google search engine using the whole world maximum is in this patent as the instrument obtaining data set.By using GoogleSearchAPI, using particular phrase as search keyword, obtain the Search Results that may comprise researcher's homepage.

The interface IP address of GoogleSearchAPI is as follows:

http://ajax.googleapis.com/ajax/services/search/web？v＝1.0&hl＝zh-CN&rsz＝large&q＝

The key search phrase used in patent has three:

Researcher name+space+researcher unit

Researcher name+space+" Homepage "

Researcher name+space+" email "

To each crucial phrase, capture the first page content of Google search engine using this phrase as keyword search (maximum 10 results), each result comprises three contents, i.e. exercise question, web page interlinkage and web-page summarization (Title, URL, Snippet).

Data set used in this patent is selected from 140,000 experts of state natural sciences fund committee, 1000 people altogether, and it is multidisciplinary to contain medicine, computing machine, natural science etc.

Two, data mark

By the data of crawl stored in text, each result accounts for a line.Mark personnel are through determining whether the personal homepage of this researcher to each row data of comparison, being be designated as 1, is not be designated as-1.

Three, data set cutting

Need to carry out cutting to data in the process with SVM modeling.Equal owing to needing positive example to bear number of cases order in modeling process, and negative routine much larger than positive example in the data of mark.Therefore to fetch data concentrated all positive examples, then randomly draw the negative example equal with positive example number, be combined into the data set that SVM modeling needs.

Experiment is divided into training process and test process, and we adopt the mode of ten folding cross validations, and the data set marked is divided into ten equal portions at random, and get 9 parts as training set, 1 part as test set at every turn.

Four, feature extraction

The result for sorting algorithm of feature selecting quality has direct impact.Except utilizing the TFIDF value of each word as except feature in this patent, also adding and comprising part of speech, other key features such as URL.

TFIDF feature:

TFIDF is a kind of conventional weighting technique for information retrieval and text mining, in order to assess the significance level of a words for a copy of it file in a file set or a corpus.The importance of words to be directly proportional increase along with the number of times that it occurs hereof, the decline but the frequency that can occur in corpus along with it is inversely proportional to simultaneously.The computing formula of TFIDF is:

tfidf(t，d，D)＝tf(t，d)*idf(t，D)

Wherein t is certain word, and d represents the article that this word occurs, D is the summation of all articles and whole corpus.Tf represents word frequency (TermFrequency), and IDF represents reverse document-frequency (InverseDocumentFrequency).

{tf}_{i, j} = \frac{n_{i, j}}{Σ_{k} n_{k, j}}

{idf}_{i} = l o g \frac{| D |}{| {j : t_{i} &Element; d_{j}} |}

Be the frequency n that this word occurs in document j for the word i in arbitrary section of document j, the word frequency tf of i _{i, j}divided by the total word number in document.The idf value of i is that the number of files of corpus is divided by the log value of number of files comprising this word.In this patent using the corpus that Title and Snippet of Search Results part is independent of each other as two, in respective word space, carry out the calculating of TFIDF value respectively.ICTCLAS is then employed for Chinese content and carries out participle.ICTCLAS is the participle instrument that the Chinese Academy of Sciences releases in more than ten years of research, and repeatedly win a prize, user is numerous.Latest edition can find in following network address, http://ictclas.nlpir.org/.

Part of speech feature:

In text analyzing, part of speech feature also has important impact to classification.Use ICTCLAS to carry out part of speech analysis to the Title of each Search Results in this patent, add up the number of times that various part of speech occurs, the number of times of often kind of part of speech appearance is as a kind of feature.

Other features:

Whether comprise sensitive keys word in URL, occur that this eigenwert is 1, do not appear as 0.Sensitive keys word comprises the result that form as this kind of in pdf, xls, doc is not inconsistent, baidu, weibo.com, this kind of news category website of 163.com, qq.com, sohu.com, and 360doc, these frequent key characters occurring interference correct result of renren, sina, download, news.

Whether there is researcher's name in Title, occur that this eigenwert is 1, do not appear as 0.

The researcher's name occurred in Snippet, its position is at Snippet first half or latter half.Appearing at first half (comprising middle) this eigenwert is 1, and latter half eigenwert is 0.

Five, SVM training and test

SVM-light is used to carry out training test in this patent.SVM-light is the Open-Source Tools based on SVM of Joachim exploitation, and because its speed is fast, accuracy rate high is widely applied in research and practical application.The specific descriptions of SVM-light and using method can find in network address http://www.cs.cornell.edu/People/tj/svm_light/.

The tag file first generated according to training set in experiment utilizes svm_learn order to learn out model, then utilizes the model run out of to test on test set above with svm_classify order to obtain a result.

Six, SVM predicts the outcome to be combined with rule and screens personal homepage

In the process of continuous training study, we obtain the model that results contrast is good.Predict test data with this model, each Search Results all can obtain a numerical value.If this value is more close to 1, illustrate that it is that the possibility of positive example is larger, if this value is more close to-1, illustrate that it makes the possibility of negative example larger.Owing to may there is the limitation of deviation and Google Search Results reflection web page contents in the process that manually marks, classification results is also not fully up to expectations.In order to address this problem, in this patent, some regular assisting sifting personal homepages are introduced.

1) detailed date (time-division) information is comprised in Snippet

2) name occurs more than three times in snippet

3) name appears at snippet latter half and only appears in paper co-worker

For prediction positive example, if there is any one situation above, then this example prediction score value is subtracted 0.3.Sort to all score values of classification results after such process, set certain threshold value (as 0.6) simultaneously, the webpage that so score value is greater than this threshold value is then considered to the personal homepage of researcher.

In addition, other formation of the lookup method of the personal homepage of the embodiment of the present invention and effect are all known for a person skilled in the art, in order to reduce redundancy, do not repeat.

In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.

Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalency thereof.

Claims

1. a lookup method for personal homepage, is characterized in that, comprises the following steps:

A: input key message in a search engine and obtain Search Results, adopts the Search Results of the first predetermined number of closest described key message in described Search Results as data set;

Whether B: manually mark from described data centralization extracting part divided data text is the personal homepage of target person for distinguishing;

C: the training set of the second predetermined number and the test set of the 3rd predetermined number are divided into the described data text marked;

D: training set characteristic information is extracted to described training set;

E: utilize SVM to carry out modeling to described training set characteristic information, obtain the first model;

F: test set characteristic information is extracted to described test set;

G: utilize described first model to analyze described test set characteristic information, predicted the outcome;

H: the personal homepage judgment rule according to presetting judges described predicting the outcome;

I: adopt ten folding cross validation methods to carry out iteration to step C to step H, choose optimization model;

J: whether described Search Results is the personal homepage of target person to adopt described optimization model to judge.

2. the lookup method of personal homepage according to claim 1, is characterized in that, in step, described key message comprises:

First search phrase, described first search phrase comprises target person name and target person unit one belongs to;

Second search phrase; Described second search phrase comprises described target person name and homepage; And

3rd search phrase, described 3rd search phrase comprises described target person name and mailbox.

3. the lookup method of personal homepage according to claim 1, is characterized in that, in step D, described training set characteristic information comprises the TFIDF value of each word in described training set, and the computing formula of wherein said TFIDF is:

tfidf(t，d，D)＝tf(t，d)*idf(t，D)

{tf}_{i, j} = \frac{n_{i, j}}{Σ_{k} n_{k, j}}

{idf}_{i} = \log \frac{| D |}{| {j : t_{i} &Element; d_{j}} |}

Wherein, for the word i in arbitrary section of document j, the frequency n that the word frequency tf of institute predicate i occurs in described document j for institute predicate i _i,jdivided by the total word number in described document; The idf value of institute predicate i is that the number of files of described corpus is divided by the log value of number of files comprising this word; Using the corpus that the title of described Search Results and web-page summarization are independent of each other as two, in respective word space, carry out the calculating of TFIDF value respectively.

4. the lookup method of personal homepage according to claim 3, it is characterized in that, described training set characteristic information also comprises part of speech, uses Chinese lexical analysis system to carry out part of speech analysis to the title of Search Results described in every bar, adds up the number of times that various part of speech occurs.

5. the lookup method of the personal homepage according to claim 3 or 4, is characterized in that, described training set characteristic information also comprises further feature, and described further feature comprises:

Whether noise word is comprised in URL;

Described target person name whether is there is in title; And

The position of described target person name is there is in web-page summarization.

6. the lookup method of personal homepage according to claim 1, it is characterized in that, in step e, adopt SVM-light to set up described first model, in step G, described SVM-light and described first model is adopted to analyze described test set characteristic information.

7. the lookup method of personal homepage according to claim 3, is characterized in that, in steph, described personal homepage judgment rule is: if there is any one situation following, then the weight predicted the outcome described in reduces,

H1: comprise year, month and day information in described web-page summarization;

H2: described target person name occurs more than three times in described web-page summarization;

H3: described target person name appears at the latter half of described web-page summarization, and only appear in paper co-worker.