CN105095400A - Method for finding personal homepage - Google Patents

Method for finding personal homepage Download PDF

Info

Publication number
CN105095400A
CN105095400A CN201510394587.3A CN201510394587A CN105095400A CN 105095400 A CN105095400 A CN 105095400A CN 201510394587 A CN201510394587 A CN 201510394587A CN 105095400 A CN105095400 A CN 105095400A
Authority
CN
China
Prior art keywords
personal homepage
target person
word
training set
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510394587.3A
Other languages
Chinese (zh)
Other versions
CN105095400B (en
Inventor
唐杰
刘德兵
杨宏
袁慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201510394587.3A priority Critical patent/CN105095400B/en
Publication of CN105095400A publication Critical patent/CN105095400A/en
Application granted granted Critical
Publication of CN105095400B publication Critical patent/CN105095400B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for finding a personal homepage, which comprises following steps: entering key information in the search engine to acquire search results, and using the search result most close to the key information as a data set; extracting part of data text from the data set to mark; dividing the marked data text into a training set and a test set; extracting training set feature information for the training set; establishing a model for the training set feature information to acquire a first model; extracting test set feature information for the test set; analyzing the test set feature information by utilizing the first model to acquire a prediction result; judging the prediction result; selecting an optimal model through iteration with 10-fold cross-validation; and utilizing the optimal model to judge whether the search result is the personal homepage of a target person or not. The method provided herein has the advantages that the practicality is strong; through the actual application process, a training sample can be collected and settled and the training set can be upgraded and expanded, thus further improving the practicality and the finding accuracy of the method.

Description

The lookup method of personal homepage
Technical field
The present invention relates to computer technology and web-information technology field, be specifically related to a kind of lookup method of personal homepage.
Background technology
It is information retrieval field very important aspect that expert finds [1].From the experts database of National Nature fund committee to responsible reader's commending system of international conference, and a lot of application such as ordinary people's medical web site doctor recommendation function that can touch all needs huge experts database support.Particularly government has put into effect again " general office of the Department of Science and Technology supplements the notice of national science and technology experts database expert info about improving " in the recent period, can find out and improve expert info, build experts database and have great importance.But experts database construction, especially to some large-scale experts databases more than ten thousand people, but the maintenance update of expert's personal information is one takes time and effort very much very important task.The accuracy of expert's personal information and the service quality of completeness on experts database have important impact.Along with the universal of internet and development, Many researchers all establishes personal homepage and keeps personal information real-time update, and this is the important channel of quick obtaining expert personal information.In this patent, we have proposed a kind of high-accuracy and adaptable personal homepage automatic searching method.The method combining information Automatic Extraction technology and manually mark work, greatly can improve expert's personal information in experts database and upgrade efficiency, and then improve the service quality of experts database.Personal homepage is searched.
Personal homepage is searched, and namely for the people of a given name and work unit, finds the page comprising its personal information from the magnanimity information of internet, and this page can be the web page that it oneself is set up, also can be place working mechanism set up introduce the page.There are some similar researchs at present, having utilized SVM to search building the useful page of community network as the people such as Zuo Nan mention under study for action [2].Although method is similar, its personal homepage is more specific compared to the useful page, more difficult excavation; The people such as Tang Jie [3,4]although research specific to personal homepage aspect, only stop at page classifications, but may error be there is due to the make a summary restricted and artificial mark of number of words of Google, extract result and still have much room for improvement.
In this patent, for feature and the deficiency that worked of personal homepage in the past, we have proposed the personal homepage lookup method of a kind of rule in conjunction with machine learning.First the method utilizes Google search engine to obtain the quality data source that may comprise personal homepage, afterwards artificial annotate portions data.Because be all likely the personal homepage expected concerning any one webpage, also likely not, so searching of personal homepage can regard two classification problems as.Adopt the data of support vector machines more classical in sorting algorithm to mark to carry out training study in patent and obtain comparatively ideal model, finally in conjunction with predefined rule-based filtering, thus find out the personal homepage of expectation.The method effectively solves the problem that the classification accuracy that causes is not high enough due to the limitation of Google Search Results reflection web page contents.
List of references:
[1] Liu Jian, Li Qi, Liu Baohong, Zhang Yun is based on expert's discover method National University of Defense technology journal Vol35 of topic model, No.22013
[2] left south, Li Juanzi, Tang Jie extract the 3rd Universal Information retrieval and content safety academic conference 2007 based on the portrait photo of SVM
[3]J.Tang,L.Yao,D.Zhang,andJ.Zhang.Acombinationapproachtowebuserprofiling.ACMTKDD,5(1):1–44,2010.
[4]J.Tang,J.Zhang,L.Yao,J.Li,L.Zhang,andZ.Su.Arnetminer:Extractionandminingofacademicsocialnetworks.KDD,pages990–998,2008
Summary of the invention
The present invention is intended at least one of solve the problems of the technologies described above.
For this reason, the object of the invention is to the lookup method proposing a kind of personal homepage.
To achieve these goals, the embodiment of a first aspect of the present invention discloses a kind of lookup method of personal homepage, comprise the following steps: A: input key message in a search engine and obtain Search Results, adopt the Search Results of the first predetermined number of closest described key message in described Search Results as data set; Whether B: manually mark from described data centralization extracting part divided data text is the personal homepage of target person for distinguishing; C: the training set of the second predetermined number and the test set of the 3rd predetermined number are divided into the described data text marked; D: training set characteristic information is extracted to described training set; E: utilize SVM to carry out modeling to described training set characteristic information, obtain the first model; F: test set characteristic information is extracted to described test set; G: utilize described first model to analyze described test set characteristic information, predicted the outcome; H: the personal homepage judgment rule according to presetting judges described predicting the outcome; I: adopt ten folding cross validation methods to carry out iteration to step C to step H, choose optimization model; J: whether described Search Results is the personal homepage of target person to adopt described optimization model to judge.
According to the lookup method of the personal homepage of the embodiment of the present invention, someone personal homepage can be found fast and accurately according to given simple information, and then the details can extracting this person by automatic algorithms or artificial mask method comprise contact method (mailbox, phone, address etc.), individual brief introduction, research interest, bear project, paper list etc.These details set up as expert think tank, the essential condition of the talent banks such as evaluation expert storehouse, and simultaneously the complete degree of these information is for such as experts recommend, and responsible reader recommends the effect of application service such as grade to have material impact.Have now much large-scale talent bank such as Natural Science Fund In The Light evaluation expert storehouse to have more than 140,000 people, but the information updating of these experts safeguard that be one takes time and effort very important engineering very much.The personal homepage lookup method of the application embodiment of the present invention, in conjunction with automated information retrieval algorithm, can improve the renewal efficiency of talent bank personal information greatly, for the real-time keeping talent bank information, improves talent bank service quality significant.
In addition, the lookup method of personal homepage according to the above embodiment of the present invention, can also have following additional technical characteristic:
Further, in step, described key message comprises: the first search phrase, and described first search phrase comprises target person name and target person unit one belongs to; Second search phrase; Described second search phrase comprises described target person name and homepage; And the 3rd searches for phrase, described 3rd search phrase comprises described target person name and mailbox.
Further, in step D, described training set characteristic information comprises the TFIDF value of each word in described training set, and the computing formula of wherein said TFIDF is:
tfidf(t,d,D)=tf(t,d)*idf(t,D)
Wherein t is word, and d represents the article that institute's predicate occurs, D is whole corpus, and tf represents word frequency, and IDF represents reverse document-frequency;
tf i , j = n i , j Σ k n k , j
idf i = l o g | D | | { j : t i ∈ d j } |
Wherein, for the word i in arbitrary section of document j, the frequency n that the word frequency tf of institute predicate i occurs in described document j for institute predicate i i, jdivided by the total word number in described document; The idf value of institute predicate i is that the number of files of described corpus is divided by the log value of number of files comprising this word; Using the corpus that the title of described Search Results and web-page summarization are independent of each other as two, in respective word space, carry out the calculating of TFIDF value respectively.
Further, described training set characteristic information also comprises part of speech, uses Chinese lexical analysis system to carry out part of speech analysis to the title of Search Results described in every bar, adds up the number of times that various part of speech occurs.
Further, described training set characteristic information also comprises further feature, and described further feature comprises: whether comprise noise word in URL; Described target person name whether is there is in title; And in web-page summarization, there is the position of described target person name.
Further, in step e, adopt SVM-light to set up described first model, in step G, adopt described SVM-light and described first model to analyze described test set characteristic information.
Further, in steph, described personal homepage judgment rule is: if there is any one situation following, then the weight predicted the outcome described in reduces, H1: comprise year, month and day information in described web-page summarization; H2: described target person name occurs more than three times in described web-page summarization; H3: described target person name appears at the latter half of described web-page summarization, and only appear in paper co-worker.
Additional aspect of the present invention and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or additional aspect of the present invention and advantage will become obvious and easy understand from accompanying drawing below combining to the description of embodiment, wherein:
Fig. 1 is that the homepage of one embodiment of the invention extracts process flow diagram.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.
In describing the invention, it will be appreciated that, term " " center ", " longitudinal direction ", " transverse direction ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end ", " interior ", orientation or the position relationship of the instruction such as " outward " are based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, instead of indicate or imply that the device of indication or element must have specific orientation, with specific azimuth configuration and operation, therefore limitation of the present invention can not be interpreted as.In addition, term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance.
In describing the invention, it should be noted that, unless otherwise clearly defined and limited, term " installation ", " being connected ", " connection " should be interpreted broadly, and such as, can be fixedly connected with, also can be removably connect, or connect integratedly; Can be mechanical connection, also can be electrical connection; Can be directly be connected, also indirectly can be connected by intermediary, can be the connection of two element internals.For the ordinary skill in the art, concrete condition above-mentioned term concrete meaning in the present invention can be understood.
With reference to description below and accompanying drawing, these and other aspects of embodiments of the invention will be known.Describe at these and in accompanying drawing, specifically disclose some particular implementation in embodiments of the invention, representing some modes of the principle implementing embodiments of the invention, but should be appreciated that the scope of embodiments of the invention is not limited.On the contrary, embodiments of the invention comprise fall into attached claims spirit and intension within the scope of all changes, amendment and equivalent.
Below in conjunction with accompanying drawing, the lookup method according to the personal homepage of the embodiment of the present invention is described.
Fig. 1 is that the homepage of one embodiment of the invention extracts process flow diagram, please refer to Fig. 1.
One, high-quality data set is obtained
Along with the development of internet, increasing information is home-confined just can be obtained from network.Statistics finds, quite a few researcher is at the online personal homepage having oneself, and researcher's relevant information listed on personal homepage builds expert think tank, the essential condition of the talent banks such as evaluation expert storehouse, simultaneously the complete degree of these information is for such as experts recommend, the effect of the application services such as responsible reader recommends has material impact, therefore how to obtain personal homepage data very crucial.Have benefited from the development of search engine, by rational keyword retrieval, just can obtain these data.Search engine popular at present has Baidu, Bing, Google tri-kinds, considers the internationalization of researcher, and the Google search engine using the whole world maximum is in this patent as the instrument obtaining data set.By using GoogleSearchAPI, using particular phrase as search keyword, obtain the Search Results that may comprise researcher's homepage.
The interface IP address of GoogleSearchAPI is as follows:
http://ajax.googleapis.com/ajax/services/search/web?v=1.0&hl=zh-CN&rsz=large&q=
The key search phrase used in patent has three:
Researcher name+space+researcher unit
Researcher name+space+" Homepage "
Researcher name+space+" email "
To each crucial phrase, capture the first page content of Google search engine using this phrase as keyword search (maximum 10 results), each result comprises three contents, i.e. exercise question, web page interlinkage and web-page summarization (Title, URL, Snippet).
Data set used in this patent is selected from 140,000 experts of state natural sciences fund committee, 1000 people altogether, and it is multidisciplinary to contain medicine, computing machine, natural science etc.
Two, data mark
By the data of crawl stored in text, each result accounts for a line.Mark personnel are through determining whether the personal homepage of this researcher to each row data of comparison, being be designated as 1, is not be designated as-1.
Three, data set cutting
Need to carry out cutting to data in the process with SVM modeling.Equal owing to needing positive example to bear number of cases order in modeling process, and negative routine much larger than positive example in the data of mark.Therefore to fetch data concentrated all positive examples, then randomly draw the negative example equal with positive example number, be combined into the data set that SVM modeling needs.
Experiment is divided into training process and test process, and we adopt the mode of ten folding cross validations, and the data set marked is divided into ten equal portions at random, and get 9 parts as training set, 1 part as test set at every turn.
Four, feature extraction
The result for sorting algorithm of feature selecting quality has direct impact.Except utilizing the TFIDF value of each word as except feature in this patent, also adding and comprising part of speech, other key features such as URL.
TFIDF feature:
TFIDF is a kind of conventional weighting technique for information retrieval and text mining, in order to assess the significance level of a words for a copy of it file in a file set or a corpus.The importance of words to be directly proportional increase along with the number of times that it occurs hereof, the decline but the frequency that can occur in corpus along with it is inversely proportional to simultaneously.The computing formula of TFIDF is:
tfidf(t,d,D)=tf(t,d)*idf(t,D)
Wherein t is certain word, and d represents the article that this word occurs, D is the summation of all articles and whole corpus.Tf represents word frequency (TermFrequency), and IDF represents reverse document-frequency (InverseDocumentFrequency).
tf i , j = n i , j Σ k n k , j
idf i = l o g | D | | { j : t i ∈ d j } |
Be the frequency n that this word occurs in document j for the word i in arbitrary section of document j, the word frequency tf of i i, jdivided by the total word number in document.The idf value of i is that the number of files of corpus is divided by the log value of number of files comprising this word.In this patent using the corpus that Title and Snippet of Search Results part is independent of each other as two, in respective word space, carry out the calculating of TFIDF value respectively.ICTCLAS is then employed for Chinese content and carries out participle.ICTCLAS is the participle instrument that the Chinese Academy of Sciences releases in more than ten years of research, and repeatedly win a prize, user is numerous.Latest edition can find in following network address, http://ictclas.nlpir.org/.
Part of speech feature:
In text analyzing, part of speech feature also has important impact to classification.Use ICTCLAS to carry out part of speech analysis to the Title of each Search Results in this patent, add up the number of times that various part of speech occurs, the number of times of often kind of part of speech appearance is as a kind of feature.
Other features:
Whether comprise sensitive keys word in URL, occur that this eigenwert is 1, do not appear as 0.Sensitive keys word comprises the result that form as this kind of in pdf, xls, doc is not inconsistent, baidu, weibo.com, this kind of news category website of 163.com, qq.com, sohu.com, and 360doc, these frequent key characters occurring interference correct result of renren, sina, download, news.
Whether there is researcher's name in Title, occur that this eigenwert is 1, do not appear as 0.
The researcher's name occurred in Snippet, its position is at Snippet first half or latter half.Appearing at first half (comprising middle) this eigenwert is 1, and latter half eigenwert is 0.
Five, SVM training and test
SVM-light is used to carry out training test in this patent.SVM-light is the Open-Source Tools based on SVM of Joachim exploitation, and because its speed is fast, accuracy rate high is widely applied in research and practical application.The specific descriptions of SVM-light and using method can find in network address http://www.cs.cornell.edu/People/tj/svm_light/.
The tag file first generated according to training set in experiment utilizes svm_learn order to learn out model, then utilizes the model run out of to test on test set above with svm_classify order to obtain a result.
Six, SVM predicts the outcome to be combined with rule and screens personal homepage
In the process of continuous training study, we obtain the model that results contrast is good.Predict test data with this model, each Search Results all can obtain a numerical value.If this value is more close to 1, illustrate that it is that the possibility of positive example is larger, if this value is more close to-1, illustrate that it makes the possibility of negative example larger.Owing to may there is the limitation of deviation and Google Search Results reflection web page contents in the process that manually marks, classification results is also not fully up to expectations.In order to address this problem, in this patent, some regular assisting sifting personal homepages are introduced.
1) detailed date (time-division) information is comprised in Snippet
2) name occurs more than three times in snippet
3) name appears at snippet latter half and only appears in paper co-worker
For prediction positive example, if there is any one situation above, then this example prediction score value is subtracted 0.3.Sort to all score values of classification results after such process, set certain threshold value (as 0.6) simultaneously, the webpage that so score value is greater than this threshold value is then considered to the personal homepage of researcher.
In addition, other formation of the lookup method of the personal homepage of the embodiment of the present invention and effect are all known for a person skilled in the art, in order to reduce redundancy, do not repeat.
In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.
Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalency thereof.

Claims (7)

1. a lookup method for personal homepage, is characterized in that, comprises the following steps:
A: input key message in a search engine and obtain Search Results, adopts the Search Results of the first predetermined number of closest described key message in described Search Results as data set;
Whether B: manually mark from described data centralization extracting part divided data text is the personal homepage of target person for distinguishing;
C: the training set of the second predetermined number and the test set of the 3rd predetermined number are divided into the described data text marked;
D: training set characteristic information is extracted to described training set;
E: utilize SVM to carry out modeling to described training set characteristic information, obtain the first model;
F: test set characteristic information is extracted to described test set;
G: utilize described first model to analyze described test set characteristic information, predicted the outcome;
H: the personal homepage judgment rule according to presetting judges described predicting the outcome;
I: adopt ten folding cross validation methods to carry out iteration to step C to step H, choose optimization model;
J: whether described Search Results is the personal homepage of target person to adopt described optimization model to judge.
2. the lookup method of personal homepage according to claim 1, is characterized in that, in step, described key message comprises:
First search phrase, described first search phrase comprises target person name and target person unit one belongs to;
Second search phrase; Described second search phrase comprises described target person name and homepage; And
3rd search phrase, described 3rd search phrase comprises described target person name and mailbox.
3. the lookup method of personal homepage according to claim 1, is characterized in that, in step D, described training set characteristic information comprises the TFIDF value of each word in described training set, and the computing formula of wherein said TFIDF is:
tfidf(t,d,D)=tf(t,d)*idf(t,D)
Wherein t is word, and d represents the article that institute's predicate occurs, D is whole corpus, and tf represents word frequency, and IDF represents reverse document-frequency;
tf i , j = n i , j Σ k n k , j
idf i = log | D | | { j : t i ∈ d j } |
Wherein, for the word i in arbitrary section of document j, the frequency n that the word frequency tf of institute predicate i occurs in described document j for institute predicate i i,jdivided by the total word number in described document; The idf value of institute predicate i is that the number of files of described corpus is divided by the log value of number of files comprising this word; Using the corpus that the title of described Search Results and web-page summarization are independent of each other as two, in respective word space, carry out the calculating of TFIDF value respectively.
4. the lookup method of personal homepage according to claim 3, it is characterized in that, described training set characteristic information also comprises part of speech, uses Chinese lexical analysis system to carry out part of speech analysis to the title of Search Results described in every bar, adds up the number of times that various part of speech occurs.
5. the lookup method of the personal homepage according to claim 3 or 4, is characterized in that, described training set characteristic information also comprises further feature, and described further feature comprises:
Whether noise word is comprised in URL;
Described target person name whether is there is in title; And
The position of described target person name is there is in web-page summarization.
6. the lookup method of personal homepage according to claim 1, it is characterized in that, in step e, adopt SVM-light to set up described first model, in step G, described SVM-light and described first model is adopted to analyze described test set characteristic information.
7. the lookup method of personal homepage according to claim 3, is characterized in that, in steph, described personal homepage judgment rule is: if there is any one situation following, then the weight predicted the outcome described in reduces,
H1: comprise year, month and day information in described web-page summarization;
H2: described target person name occurs more than three times in described web-page summarization;
H3: described target person name appears at the latter half of described web-page summarization, and only appear in paper co-worker.
CN201510394587.3A 2015-07-07 2015-07-07 The lookup method of personal homepage Active CN105095400B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510394587.3A CN105095400B (en) 2015-07-07 2015-07-07 The lookup method of personal homepage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510394587.3A CN105095400B (en) 2015-07-07 2015-07-07 The lookup method of personal homepage

Publications (2)

Publication Number Publication Date
CN105095400A true CN105095400A (en) 2015-11-25
CN105095400B CN105095400B (en) 2019-02-05

Family

ID=54575837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510394587.3A Active CN105095400B (en) 2015-07-07 2015-07-07 The lookup method of personal homepage

Country Status (1)

Country Link
CN (1) CN105095400B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126618A (en) * 2016-06-22 2016-11-16 清华大学 Email address based on name recommends method and system
CN108090223A (en) * 2018-01-05 2018-05-29 牛海波 A kind of opening scholar portrait method based on internet information
CN108733634A (en) * 2017-04-20 2018-11-02 北大方正集团有限公司 The recognition methods of bibliography and identification device
CN112767022A (en) * 2021-01-13 2021-05-07 平安普惠企业管理有限公司 Mobile application function evolution trend prediction method and device and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1350245A (en) * 2001-12-03 2002-05-22 上海交通大学 Personal homepage content safety monitoring method
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1350245A (en) * 2001-12-03 2002-05-22 上海交通大学 Personal homepage content safety monitoring method
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ALEXANDER HOLD ET AL: "ECIR-a Lightweight Approach for Entity-centric Information Retrieval", 《THE NINETEENTH TEXT RETRIEVAL CONFERENCE (TREC 2010) PROCEEDINGS》 *
李丽娜 等: "中文专家实体主页识别方法研究", 《广西师范大学学报:自然科学版》 *
李毅: "学术主页信息抽取系统的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126618A (en) * 2016-06-22 2016-11-16 清华大学 Email address based on name recommends method and system
CN108733634A (en) * 2017-04-20 2018-11-02 北大方正集团有限公司 The recognition methods of bibliography and identification device
CN108090223A (en) * 2018-01-05 2018-05-29 牛海波 A kind of opening scholar portrait method based on internet information
CN112767022A (en) * 2021-01-13 2021-05-07 平安普惠企业管理有限公司 Mobile application function evolution trend prediction method and device and computer equipment
CN112767022B (en) * 2021-01-13 2024-02-27 湖南天添汇见企业管理咨询服务有限责任公司 Mobile application function evolution trend prediction method and device and computer equipment

Also Published As

Publication number Publication date
CN105095400B (en) 2019-02-05

Similar Documents

Publication Publication Date Title
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN105488024B (en) The abstracting method and device of Web page subject sentence
CN106777043A (en) A kind of academic resources acquisition methods based on LDA
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
WO2014054052A2 (en) Context based co-operative learning system and method for representing thematic relationships
Du et al. An approach for selecting seed URLs of focused crawler based on user-interest ontology
CN105095400A (en) Method for finding personal homepage
CN109345272A (en) One kind is based on the markovian shop credit risk forecast method of improvement
CN114090861A (en) Education field search engine construction method based on knowledge graph
Dueñas-Fernández et al. Detecting trends on the web: A multidisciplinary approach
CN113901466A (en) Open-source community-oriented security tool knowledge graph construction method and device
Cui Application of deep learning and target visual detection in english vocabulary online teaching
Jing et al. An integrated implicit user preference mining approach for uncertain conceptual design decision-making: A pipeline inspection trolley design case study
Bu et al. An FAR-SW based approach for webpage information extraction
CN107239509A (en) Towards single Topics Crawling method and system of short text
Shri et al. An effective approach to rank reviews based on relevance by weighting method
de STEFANO et al. Development of a software for metric studies of transportation engineering journals
Trummer WebChecker: Towards an Infrastructure for Efficient Misinformation Detection at Web Scale.
Chen Building a term suggestion and ranking system based on a probabilistic analysis model and a semantic analysis graph
Herdiawan et al. Analysis of Employment Sentiment in the Indonesian Telematics Field Use Multinomial Naive Bayes and Vector Space Model
Hati et al. Improved focused crawling approach for retrieving relevant pages based on block partitioning
Yuan et al. OPO: Online public opinion analysis system over text streams
Yang et al. Sentiment analysis of news comments based on improved intuitionistic fuzzy reasoning
Sequeira et al. Dynamic review modelling and recommendation of tourism data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant