CN105095400B

CN105095400B - The lookup method of personal homepage

Info

Publication number: CN105095400B
Application number: CN201510394587.3A
Authority: CN
Inventors: 唐杰; 刘德兵; 杨宏; 袁慧
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2015-07-07
Filing date: 2015-07-07
Publication date: 2019-02-05
Anticipated expiration: 2035-07-07
Also published as: CN105095400A

Abstract

The invention discloses a kind of lookup methods of personal homepage, comprising the following steps: inputs key message in a search engine and obtains search result, using the search result closest to key message as data set；Extraction section data text is labeled from data set；Training set and test set are divided into the data text marked；Training set characteristic information is extracted to training set；Training set characteristic information is modeled, the first model is obtained；Test set characteristic information is extracted to test set；Test set characteristic information is analyzed using the first model, obtains prediction result；Prediction result is judged；It is iterated by ten folding cross validations, chooses optimal models；Optimal models are used to judge search result whether for the personal homepage of target person.The present invention has the advantage that adaptability is stronger, training set can be updated and expand, and then improve the applicability of this method and search accuracy rate by compiling training sample in actual application.

Description

The lookup method of personal homepage

Technical field

The present invention relates to computer technology and web-information technology fields, and in particular to a kind of lookup method of personal homepage.

Background technique

Expert is the discovery that one very important aspect of information retrieval field^[1].From National Nature fund entrust experts database to Medical web site doctor's recommendation function that responsible reader's recommender system of international conference and ordinary people can touch etc. is much applied Huge experts database is required to support.Especially recent government has put into effect that " general office of the Department of Science and Technology is about the national section of perfect supplement again The notice of skill experts database expert info ", it can be seen that expert info is improved, building experts database has great importance.However specially Family's library construction, especially to some large-scale experts databases more than ten thousand people, it is a very time-consuming consumption that the maintenance of expert's personal information, which updates, Power but very important task.The accuracy and completeness of expert's personal information have the service quality of experts database important It influences.With the popularity of the internet with development, Many researchers all establish personal homepage and keep personal information real-time update, This is the important channel of quick obtaining expert's personal information.In this patent, we have proposed a kind of high-accuracy and adaptability Strong personal homepage automatic searching method.The automatic extraction technique of this method combining information and work is manually marked, can mentioned significantly Expert's personal information updates efficiency in high experts database, and then improves the service quality of experts database.Personal homepage is searched.

Personal homepage is searched, i.e., the people of name and work unit given for one looks for from the massive information of internet To the page comprising its personal information, which can be the web page of their own foundation, working machine building where being also possible to Vertical introduction page.Similar research more existing at present, such as left south et al. is mentioned under study for action is searched using SVM to building society It can the useful page of network^[2].Although method is similar, its personal homepage is compared to the useful page more specificization, it more difficult to send out Pick；Tang Jie et al.^[3,4]Although research specific to personal homepage level, only stop at page classifications, however due to There may be faults for the restricted and artificial mark of Google abstract number of words, and it is still to be improved to extract result.

In this patent, for personal homepage the characteristics of and the deficiency to work in the past, we have proposed a kind of rule knots Close the personal homepage lookup method of machine learning.This method may include personal homepage first with Google search engine acquisition Quality data source, later manually mark partial data.Because being likely to be desired individual for any one webpage Homepage, it is also possible to not be, so the lookup of personal homepage can regard two classification problems as.It is calculated in patent using classification More classical support vector machines are trained study to the data of mark and obtain comparatively ideal model in method, finally combine Rule-based filtering predetermined, to find out desired personal homepage.This method effective solution is tied since Google is searched for The not high enough problem of classification accuracy caused by the limitation of fruit reflection web page contents.

Bibliography:

[1] Liu Jian, Li Qi, Liu Baohong, Zhang Yun have found method National University of Defense technology journal Vol based on the expert of topic model 35,No.22013

[2] left south, Li Juanzi, Tang Jie extract the retrieval of third Universal Information and content safety based on the portrait photo of SVM Academic conference 2007

[3]J.Tang,L.Yao,D.Zhang,and J.Zhang.A combination approach to web user profiling.ACM TKDD,5(1):1–44,2010.

[4]J.Tang,J.Zhang,L.Yao,J.Li,L.Zhang,and Z.Su.Arnetminer:Extraction and mining of academic social networks.KDD,pages 990–998,2008

Summary of the invention

The present invention is directed at least solve one of above-mentioned technical problem.

For this purpose, it is an object of the invention to propose a kind of lookup method of personal homepage.

To achieve the goals above, the embodiment of the first aspect of the present invention discloses a kind of lookup side of personal homepage Method, comprising the following steps: A: key message is inputted in a search engine and obtains search result, using most connecing in described search result The search result of first preset quantity of the nearly key message is as data set；B: the extraction section data from the data set Text is manually marked, for distinguish whether be target person personal homepage；C: to the data text marked Originally it is divided into the training set of the second preset quantity and the test set of third preset quantity；D: training set feature is extracted to the training set Information；E: the training set characteristic information is modeled using SVM, obtains the first model；F: the test set is extracted and is surveyed Examination collection characteristic information；G: the test set characteristic information is analyzed using first model, obtains prediction result；H: root The prediction result is judged according to preset personal homepage judgment rule；I: using ten folding cross validation methods to step C It is iterated to step H, chooses optimal models；J: the optimal models are used to judge described search result whether for target person Personal homepage.

The lookup method of personal homepage according to an embodiment of the present invention, can be fast and accurately according to given simple letter Breath finds someone personal homepage, and then can include by the details that automatic algorithms or artificial mask method extract this person Contact method (mailbox, phone, address etc.), personal brief introduction, research interest undertake project, paper list etc..These details It is to establish if expert think tank, the essential condition of the talent banks such as evaluation expert library, while the complete degree of these information are for such as specially The effect for the application services such as family is recommended, and responsible reader recommends has a major impact.Now with the talent bank such as natural science of many large sizes Fund evaluation expert library has more than 140,000 people, the information update maintenance of these experts is one and takes time and effort very much but weigh very much The engineering wanted.It can be greatly improved using the personal homepage lookup method of the embodiment of the present invention in conjunction with automated information retrieval algorithm The update efficiency of talent bank personal information, for keeping the real-time of talent bank information, improving talent bank service quality has weight Want meaning.

In addition, the lookup method of personal homepage according to the above embodiment of the present invention, can also have following additional skill Art feature:

Further, in step, the key message includes: the first search phrase, and the first search phrase includes Target person name and target person unit one belongs to；Second search phrase；The second search phrase includes the target person Name and homepage；And third searches for phrase, the third search phrase includes the target person name and mailbox.

Further, in step D, the training set characteristic information includes the TFIDF value of each word in the training set, The wherein calculation formula of the TFIDF are as follows:

Tfidf (t, d, D)=tf (t, d) * idf (t, D)

Wherein t is word, and d represents the article of institute's predicate appearance, and D is entire corpus, and tf represents word frequency, and IDF represents reverse Document-frequency；

Wherein, occur in the document j for the word i in any document j, the word frequency tf of institute predicate i for institute predicate i Frequency n_{I, j}Divided by total word number in the document；The idf value of institute predicate i is the number of files of the corpus divided by including this The log value of the number of files of word；The corpus that the title of described search result and web-page summarization are independent of each other as two, respectively The calculating of TFIDF value is carried out in respective word space.

Further, the training set characteristic information further includes part of speech, using Chinese lexical analysis system to described in every The title of search result carries out part of speech analysis, counts the number that various parts of speech occur.

Further, the training set characteristic information further includes other feature, the other feature include: in URL whether Include noise word；Whether the target person name is occurred in title；And occurs the target person name in web-page summarization Position.

Further, in step E, first model is established using SVM-light, in step G, using the SVM- Light and first model analyze the test set characteristic information.

Further, in steph, the personal homepage judgment rule is: if there is any one following situation, institute The weight for stating prediction result is reduced, H1: including year, month and day information in the web-page summarization；H2: the target person name exists Occur more than three times in the web-page summarization；H3: the target person name appears in the latter half of the web-page summarization, and And it only occurs in paper partner.

Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Above-mentioned and/or additional aspect of the invention and advantage will become from the description of the embodiment in conjunction with the following figures Obviously and it is readily appreciated that, in which:

Fig. 1 is that the homepage of one embodiment of the invention extracts flow chart.

Specific embodiment

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.

In the description of the present invention, it is to be understood that, term " center ", " longitudinal direction ", " transverse direction ", "upper", "lower", The orientation or positional relationship of the instructions such as "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outside" is It is based on the orientation or positional relationship shown in the drawings, is merely for convenience of description of the present invention and simplification of the description, rather than instruction or dark Show that signified device or element must have a particular orientation, be constructed and operated in a specific orientation, therefore should not be understood as pair Limitation of the invention.In addition, term " first ", " second " are used for description purposes only, it is not understood to indicate or imply opposite Importance.

In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " installation ", " phase Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected；It can To be mechanical connection, it is also possible to be electrically connected；It can be directly connected, can also can be indirectly connected through an intermediary Connection inside two elements.For the ordinary skill in the art, above-mentioned term can be understood at this with concrete condition Concrete meaning in invention.

Referring to following description and drawings, it will be clear that these and other aspects of the embodiment of the present invention.In these descriptions In attached drawing, some particular implementations in the embodiment of the present invention are specifically disclosed, to indicate to implement implementation of the invention Some modes of the principle of example, but it is to be understood that the scope of embodiments of the invention is not limited.On the contrary, of the invention Embodiment includes all changes, modification and the equivalent fallen within the scope of the spirit and intension of attached claims.

The lookup method of personal homepage according to an embodiment of the present invention is described below in conjunction with attached drawing.

Fig. 1 is that the homepage of one embodiment of the invention extracts flow chart, please refers to Fig. 1.

One, the data set of high quality is obtained

With the development of internet, more and more information, which stay indoors, to obtain from network.Statistics discovery, quite A part of researcher has the personal homepage of oneself on the net, and researcher's relevant information listed on personal homepage is building Expert think tank, the essential condition of the talent banks such as evaluation expert library, while the complete degree of these information recommend such as expert, examine The effect of the application services such as original text people recommendation has a major impact, therefore it is extremely crucial how to obtain personal homepage data.Have benefited from searching These data can be obtained by reasonable keyword retrieval by indexing the development held up.Search engine popular at present has Tri- kinds of Baidu, Bing, Google, it is contemplated that the internationalization of researcher is searched for using the maximum Google in the whole world in this patent Engine is as the tool for obtaining data set.By using Google Search API, using particular phrase as search key, Obtaining may be comprising the search result of researcher's homepage.

The interface IP address of Google Search API is as follows:

Http:// ajax.googleapis.com/ajax/services/search/web? v=1.0&hl=zh-CN& Rsz=large&q=

There are three key search phrases used in patent:

Researcher's name+space+researcher's unit

Researcher's name+space+" Homepage "

Researcher's name+space+" email "

To each crucial phrase, first page content of the Google search engine using the phrase as keyword search is grabbed (most 10 results), each result include three contents, i.e., topic, web page interlinkage and web-page summarization (Title, URL, Snippet)。

Used data set is selected from 140,000 experts of state natural sciences fund committee in this patent, in total 1000 people, And it is multidisciplinary to cover medicine, computer, natural science etc..

Two, data mark

By in the data deposit text file of crawl, each result accounts for a line.Mark personnel pass through to each row of comparison The personal homepage that data are made to determine whether as the researcher is to be designated as 1, is not to be designated as -1.

Three, data set cutting

It needs to carry out cutting to data during being modeled with SVM.Due to needing the negative number of cases mesh of positive example in modeling process It is equal, and negative example is much larger than positive example in the data marked.Therefore then access is randomly selected and positive number of cases according to all positive examples are concentrated The equal negative example of mesh is combined into the data set that SVM modeling needs.

Experiment is divided into training process and test process, we are by the way of ten folding cross validations, the data that will mark Collection is randomly divided into ten equal portions, takes 9 parts every time as training set, 1 part is used as test set.

Four, feature extraction

Feature selecting quality has direct influence for the result of sorting algorithm.In addition to utilizing each word in this patent TFIDF value as feature except, be also added into including part of speech, other key features such as URL.

TFIDF feature:

TFIDF is a kind of common weighting technique for information retrieval and text mining, to assess a words for one The significance level of a file set or a copy of it file in a corpus.The importance of words occurs hereof with it The directly proportional increase of number, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.The calculation formula of TFIDF Are as follows:

Tfidf (t, d, D)=tf (t, d) * idf (t, D)

Wherein t is some word, and d represents the article of this word appearance, and D is summation, that is, entire corpus of all articles.tf It represents word frequency (Term Frequency), IDF represents reverse document-frequency (Inverse Document Frequency).

Word frequency tf for word i, i in any document j is the frequency n that the word occurs in document j_{I, j}Divided by document In total word number.The idf value of i is the number of files of corpus divided by the log value of the number of files comprising the word.It will search in this patent The part Title and Snippet of hitch fruit is used as two corpus being independent of each other, and carries out in respective word space respectively The calculating of TFIDF value.Then ICTCLAS is used to segment Chinese content.ICTCLAS is that Chinese Academy of Sciences's research pushed away in more than ten years Participle tool out, repeatedly prize-winning, user is numerous.Latest edition can be found in following network address, http: // ictclas.nlpir.org/。

Part of speech feature:

In text analyzing, part of speech feature also has important influence to classification.In this patent using ICTCLAS to every The Title of one search result carries out part of speech analysis, counts the number that various parts of speech occur, the number conduct that every kind of part of speech occurs A kind of feature.

Other features:

It whether include sensitive keys word in URL, this characteristic value occur is 1, does not occur being 0.Sensitive keys word includes such as It is that this kind of format of pdf, xls, doc is not inconsistent as a result, this kind of news of baidu, weibo.com, 163.com, qq.com, sohu.com Class website and 360doc, renren, sina, download, news these frequently occur interference correct result keyword Symbol.

Whether researcher name is occurred in Title, and this characteristic value occur is 1, does not occur being 0.

The researcher's name occurred in Snippet, position is in Snippet first half or latter half.It appears in First half (including middle) this characteristic value is 1, and latter half characteristic value is 0.

Five, SVM training and test

Test is trained using SVM-light in this patent.SVM-light is Joachim exploitation based on SVM Open-Source Tools, since the features such as its speed is fast, and accuracy rate is high is widely applied in research and practical application.SVM-light's Specific descriptions and application method can be in network address http://www.cs.cornell.edu/People/tj/svm_light/ It finds.

The tag file first generated according to training set in experiment learns model out using svm_learn order, then uses svm_ The model that classify order is run out of using front is tested on test set and is obtained a result.

Six, SVM prediction result is combined screening personal homepage with rule

During continuous training study, we obtain the relatively good models of result.With the model to test data It is predicted, each search result can all obtain a numerical value.Illustrate that it is the possibility of positive example if 1 if the value Property is bigger, illustrates that a possibility that it makes negative example is bigger if -1 if the value.It may during due to manually marking There are deviation and the limitation of Google search result reflection web page contents, classification results are simultaneously not fully up to expectations.In order to solve This problem introduces several regular assisting sifting personal homepages in this patent.

It 1) include detailed date (time-division) information in Snippet

2) name occurs more than three times in snippet

3) name appears in snippet latter half and only occurs in paper partner

For predicting positive example, if there is a kind of situation of any of the above, then this prediction score value is subtracted 0.3.It handles in this way All score values of classification results are ranked up afterwards, concurrently set some threshold value (such as 0.6), then score value is greater than the net of the threshold value Page is then considered as the personal homepage of researcher.

In addition, the lookup method of the personal homepage of the embodiment of the present invention other compositions and effect for this field skill All be for art personnel it is known, in order to reduce redundancy, do not repeat them here.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiment or examples in can be combined in any suitable manner.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that: not A variety of change, modification, replacement and modification can be carried out to these embodiments in the case where being detached from the principle of the present invention and objective, this The range of invention is by claim and its equivalent limits.

Claims

1. a kind of lookup method of personal homepage, which comprises the following steps:

A: key message is inputted in a search engine and obtains search result, is believed using in described search result closest to the key For the search result of first preset quantity of breath as data set, described search result includes the first page content of search results pages；

B: extraction section data text is manually marked from the data set, for distinguish whether be target person individual Homepage；

C: it is divided into the training set of the second preset quantity and the test of third preset quantity to the data text marked Collection；

D: training set characteristic information is extracted to the training set, wherein the training set characteristic information includes in the training set The TFIDF value of each word, wherein the calculation formula of the TFIDF are as follows:

Tfidf (t, d, D)=tf (t, d) * idf (t, D)

Wherein t is word, and d represents the article of institute's predicate appearance, and D is entire corpus, and tf represents word frequency, and idf represents reverse file Frequency；

Wherein, time occurred in the document j for the word i in any document j, the word frequency tf of institute predicate i for institute predicate i Number n_i,jDivided by total word number in the document；The idf value of institute predicate i is the number of files of the corpus divided by including the word The log value of number of files；The corpus that the title of search result in the training set and web-page summarization are independent of each other as two Library carries out the calculating of TFIDF value in respective word space respectively；The training set characteristic information further includes part of speech, uses the Chinese Words and phrases method analysis system carries out part of speech analysis to the title of the search result in training set described in every, counts various parts of speech and occurs Number；The training set characteristic information further includes other feature, and whether the other feature includes: in URL comprising noise word； Whether the target person name is occurred in title；And occurs the position of the target person name in web-page summarization；

E: the training set characteristic information is modeled using SVM, obtains the first model；

F: test set characteristic information is extracted to the test set；

G: the test set characteristic information is analyzed using first model, obtains prediction result；

H: the prediction result is judged according to preset personal homepage judgment rule；

I: being iterated step C to step H using ten folding cross validation methods, chooses optimal models；

J: the optimal models are used to judge the search result in the test set whether for the personal homepage of target person.

2. the lookup method of personal homepage according to claim 1, which is characterized in that in step, the key message Include:

First search phrase, the first search phrase includes target person name and target person unit one belongs to；

Second search phrase；The second search phrase includes the target person name and homepage；And

Third searches for phrase, and the third search phrase includes the target person name and mailbox.

3. the lookup method of personal homepage according to claim 1, which is characterized in that in step E, using SVM-light First model is established, in step G, the test set feature is believed using the SVM-light and first model Breath is analyzed.

4. the lookup method of personal homepage according to claim 1, which is characterized in that in steph, the personal homepage Judgment rule is: if there is any one following situation, the weight of the prediction result is reduced,

H1: including year, month and day information in the web-page summarization；

H2: the target person name occurs more than three times in the web-page summarization；

H3: the target person name appears in the latter half of the web-page summarization, and only occurs in paper partner.