CN105095400B - The lookup method of personal homepage - Google Patents
The lookup method of personal homepage Download PDFInfo
- Publication number
- CN105095400B CN105095400B CN201510394587.3A CN201510394587A CN105095400B CN 105095400 B CN105095400 B CN 105095400B CN 201510394587 A CN201510394587 A CN 201510394587A CN 105095400 B CN105095400 B CN 105095400B
- Authority
- CN
- China
- Prior art keywords
- training set
- target person
- search
- characteristic information
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
Abstract
The invention discloses a kind of lookup methods of personal homepage, comprising the following steps: inputs key message in a search engine and obtains search result, using the search result closest to key message as data set;Extraction section data text is labeled from data set;Training set and test set are divided into the data text marked;Training set characteristic information is extracted to training set;Training set characteristic information is modeled, the first model is obtained;Test set characteristic information is extracted to test set;Test set characteristic information is analyzed using the first model, obtains prediction result;Prediction result is judged;It is iterated by ten folding cross validations, chooses optimal models;Optimal models are used to judge search result whether for the personal homepage of target person.The present invention has the advantage that adaptability is stronger, training set can be updated and expand, and then improve the applicability of this method and search accuracy rate by compiling training sample in actual application.
Description
Technical field
The present invention relates to computer technology and web-information technology fields, and in particular to a kind of lookup method of personal homepage.
Background technique
Expert is the discovery that one very important aspect of information retrieval field[1].From National Nature fund entrust experts database to
Medical web site doctor's recommendation function that responsible reader's recommender system of international conference and ordinary people can touch etc. is much applied
Huge experts database is required to support.Especially recent government has put into effect that " general office of the Department of Science and Technology is about the national section of perfect supplement again
The notice of skill experts database expert info ", it can be seen that expert info is improved, building experts database has great importance.However specially
Family's library construction, especially to some large-scale experts databases more than ten thousand people, it is a very time-consuming consumption that the maintenance of expert's personal information, which updates,
Power but very important task.The accuracy and completeness of expert's personal information have the service quality of experts database important
It influences.With the popularity of the internet with development, Many researchers all establish personal homepage and keep personal information real-time update,
This is the important channel of quick obtaining expert's personal information.In this patent, we have proposed a kind of high-accuracy and adaptability
Strong personal homepage automatic searching method.The automatic extraction technique of this method combining information and work is manually marked, can mentioned significantly
Expert's personal information updates efficiency in high experts database, and then improves the service quality of experts database.Personal homepage is searched.
Personal homepage is searched, i.e., the people of name and work unit given for one looks for from the massive information of internet
To the page comprising its personal information, which can be the web page of their own foundation, working machine building where being also possible to
Vertical introduction page.Similar research more existing at present, such as left south et al. is mentioned under study for action is searched using SVM to building society
It can the useful page of network[2].Although method is similar, its personal homepage is compared to the useful page more specificization, it more difficult to send out
Pick;Tang Jie et al.[3,4]Although research specific to personal homepage level, only stop at page classifications, however due to
There may be faults for the restricted and artificial mark of Google abstract number of words, and it is still to be improved to extract result.
In this patent, for personal homepage the characteristics of and the deficiency to work in the past, we have proposed a kind of rule knots
Close the personal homepage lookup method of machine learning.This method may include personal homepage first with Google search engine acquisition
Quality data source, later manually mark partial data.Because being likely to be desired individual for any one webpage
Homepage, it is also possible to not be, so the lookup of personal homepage can regard two classification problems as.It is calculated in patent using classification
More classical support vector machines are trained study to the data of mark and obtain comparatively ideal model in method, finally combine
Rule-based filtering predetermined, to find out desired personal homepage.This method effective solution is tied since Google is searched for
The not high enough problem of classification accuracy caused by the limitation of fruit reflection web page contents.
Bibliography:
[1] Liu Jian, Li Qi, Liu Baohong, Zhang Yun have found method National University of Defense technology journal Vol based on the expert of topic model
35,No.22013
[2] left south, Li Juanzi, Tang Jie extract the retrieval of third Universal Information and content safety based on the portrait photo of SVM
Academic conference 2007
[3]J.Tang,L.Yao,D.Zhang,and J.Zhang.A combination approach to web
user profiling.ACM TKDD,5(1):1–44,2010.
[4]J.Tang,J.Zhang,L.Yao,J.Li,L.Zhang,and Z.Su.Arnetminer:Extraction
and mining of academic social networks.KDD,pages 990–998,2008
Summary of the invention
The present invention is directed at least solve one of above-mentioned technical problem.
For this purpose, it is an object of the invention to propose a kind of lookup method of personal homepage.
To achieve the goals above, the embodiment of the first aspect of the present invention discloses a kind of lookup side of personal homepage
Method, comprising the following steps: A: key message is inputted in a search engine and obtains search result, using most connecing in described search result
The search result of first preset quantity of the nearly key message is as data set;B: the extraction section data from the data set
Text is manually marked, for distinguish whether be target person personal homepage;C: to the data text marked
Originally it is divided into the training set of the second preset quantity and the test set of third preset quantity;D: training set feature is extracted to the training set
Information;E: the training set characteristic information is modeled using SVM, obtains the first model;F: the test set is extracted and is surveyed
Examination collection characteristic information;G: the test set characteristic information is analyzed using first model, obtains prediction result;H: root
The prediction result is judged according to preset personal homepage judgment rule;I: using ten folding cross validation methods to step C
It is iterated to step H, chooses optimal models;J: the optimal models are used to judge described search result whether for target person
Personal homepage.
The lookup method of personal homepage according to an embodiment of the present invention, can be fast and accurately according to given simple letter
Breath finds someone personal homepage, and then can include by the details that automatic algorithms or artificial mask method extract this person
Contact method (mailbox, phone, address etc.), personal brief introduction, research interest undertake project, paper list etc..These details
It is to establish if expert think tank, the essential condition of the talent banks such as evaluation expert library, while the complete degree of these information are for such as specially
The effect for the application services such as family is recommended, and responsible reader recommends has a major impact.Now with the talent bank such as natural science of many large sizes
Fund evaluation expert library has more than 140,000 people, the information update maintenance of these experts is one and takes time and effort very much but weigh very much
The engineering wanted.It can be greatly improved using the personal homepage lookup method of the embodiment of the present invention in conjunction with automated information retrieval algorithm
The update efficiency of talent bank personal information, for keeping the real-time of talent bank information, improving talent bank service quality has weight
Want meaning.
In addition, the lookup method of personal homepage according to the above embodiment of the present invention, can also have following additional skill
Art feature:
Further, in step, the key message includes: the first search phrase, and the first search phrase includes
Target person name and target person unit one belongs to;Second search phrase;The second search phrase includes the target person
Name and homepage;And third searches for phrase, the third search phrase includes the target person name and mailbox.
Further, in step D, the training set characteristic information includes the TFIDF value of each word in the training set,
The wherein calculation formula of the TFIDF are as follows:
Tfidf (t, d, D)=tf (t, d) * idf (t, D)
Wherein t is word, and d represents the article of institute's predicate appearance, and D is entire corpus, and tf represents word frequency, and IDF represents reverse
Document-frequency;
Wherein, occur in the document j for the word i in any document j, the word frequency tf of institute predicate i for institute predicate i
Frequency nI, jDivided by total word number in the document;The idf value of institute predicate i is the number of files of the corpus divided by including this
The log value of the number of files of word;The corpus that the title of described search result and web-page summarization are independent of each other as two, respectively
The calculating of TFIDF value is carried out in respective word space.
Further, the training set characteristic information further includes part of speech, using Chinese lexical analysis system to described in every
The title of search result carries out part of speech analysis, counts the number that various parts of speech occur.
Further, the training set characteristic information further includes other feature, the other feature include: in URL whether
Include noise word;Whether the target person name is occurred in title;And occurs the target person name in web-page summarization
Position.
Further, in step E, first model is established using SVM-light, in step G, using the SVM-
Light and first model analyze the test set characteristic information.
Further, in steph, the personal homepage judgment rule is: if there is any one following situation, institute
The weight for stating prediction result is reduced, H1: including year, month and day information in the web-page summarization;H2: the target person name exists
Occur more than three times in the web-page summarization;H3: the target person name appears in the latter half of the web-page summarization, and
And it only occurs in paper partner.
Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description
Obviously, or practice through the invention is recognized.
Detailed description of the invention
Above-mentioned and/or additional aspect of the invention and advantage will become from the description of the embodiment in conjunction with the following figures
Obviously and it is readily appreciated that, in which:
Fig. 1 is that the homepage of one embodiment of the invention extracts flow chart.
Specific embodiment
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.
In the description of the present invention, it is to be understood that, term " center ", " longitudinal direction ", " transverse direction ", "upper", "lower",
The orientation or positional relationship of the instructions such as "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outside" is
It is based on the orientation or positional relationship shown in the drawings, is merely for convenience of description of the present invention and simplification of the description, rather than instruction or dark
Show that signified device or element must have a particular orientation, be constructed and operated in a specific orientation, therefore should not be understood as pair
Limitation of the invention.In addition, term " first ", " second " are used for description purposes only, it is not understood to indicate or imply opposite
Importance.
In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " installation ", " phase
Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can
To be mechanical connection, it is also possible to be electrically connected;It can be directly connected, can also can be indirectly connected through an intermediary
Connection inside two elements.For the ordinary skill in the art, above-mentioned term can be understood at this with concrete condition
Concrete meaning in invention.
Referring to following description and drawings, it will be clear that these and other aspects of the embodiment of the present invention.In these descriptions
In attached drawing, some particular implementations in the embodiment of the present invention are specifically disclosed, to indicate to implement implementation of the invention
Some modes of the principle of example, but it is to be understood that the scope of embodiments of the invention is not limited.On the contrary, of the invention
Embodiment includes all changes, modification and the equivalent fallen within the scope of the spirit and intension of attached claims.
The lookup method of personal homepage according to an embodiment of the present invention is described below in conjunction with attached drawing.
Fig. 1 is that the homepage of one embodiment of the invention extracts flow chart, please refers to Fig. 1.
One, the data set of high quality is obtained
With the development of internet, more and more information, which stay indoors, to obtain from network.Statistics discovery, quite
A part of researcher has the personal homepage of oneself on the net, and researcher's relevant information listed on personal homepage is building
Expert think tank, the essential condition of the talent banks such as evaluation expert library, while the complete degree of these information recommend such as expert, examine
The effect of the application services such as original text people recommendation has a major impact, therefore it is extremely crucial how to obtain personal homepage data.Have benefited from searching
These data can be obtained by reasonable keyword retrieval by indexing the development held up.Search engine popular at present has
Tri- kinds of Baidu, Bing, Google, it is contemplated that the internationalization of researcher is searched for using the maximum Google in the whole world in this patent
Engine is as the tool for obtaining data set.By using Google Search API, using particular phrase as search key,
Obtaining may be comprising the search result of researcher's homepage.
The interface IP address of Google Search API is as follows:
Http:// ajax.googleapis.com/ajax/services/search/web? v=1.0&hl=zh-CN&
Rsz=large&q=
There are three key search phrases used in patent:
Researcher's name+space+researcher's unit
Researcher's name+space+" Homepage "
Researcher's name+space+" email "
To each crucial phrase, first page content of the Google search engine using the phrase as keyword search is grabbed
(most 10 results), each result include three contents, i.e., topic, web page interlinkage and web-page summarization (Title, URL,
Snippet)。
Used data set is selected from 140,000 experts of state natural sciences fund committee in this patent, in total 1000 people,
And it is multidisciplinary to cover medicine, computer, natural science etc..
Two, data mark
By in the data deposit text file of crawl, each result accounts for a line.Mark personnel pass through to each row of comparison
The personal homepage that data are made to determine whether as the researcher is to be designated as 1, is not to be designated as -1.
Three, data set cutting
It needs to carry out cutting to data during being modeled with SVM.Due to needing the negative number of cases mesh of positive example in modeling process
It is equal, and negative example is much larger than positive example in the data marked.Therefore then access is randomly selected and positive number of cases according to all positive examples are concentrated
The equal negative example of mesh is combined into the data set that SVM modeling needs.
Experiment is divided into training process and test process, we are by the way of ten folding cross validations, the data that will mark
Collection is randomly divided into ten equal portions, takes 9 parts every time as training set, 1 part is used as test set.
Four, feature extraction
Feature selecting quality has direct influence for the result of sorting algorithm.In addition to utilizing each word in this patent
TFIDF value as feature except, be also added into including part of speech, other key features such as URL.
TFIDF feature:
TFIDF is a kind of common weighting technique for information retrieval and text mining, to assess a words for one
The significance level of a file set or a copy of it file in a corpus.The importance of words occurs hereof with it
The directly proportional increase of number, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.The calculation formula of TFIDF
Are as follows:
Tfidf (t, d, D)=tf (t, d) * idf (t, D)
Wherein t is some word, and d represents the article of this word appearance, and D is summation, that is, entire corpus of all articles.tf
It represents word frequency (Term Frequency), IDF represents reverse document-frequency (Inverse Document Frequency).
Word frequency tf for word i, i in any document j is the frequency n that the word occurs in document jI, jDivided by document
In total word number.The idf value of i is the number of files of corpus divided by the log value of the number of files comprising the word.It will search in this patent
The part Title and Snippet of hitch fruit is used as two corpus being independent of each other, and carries out in respective word space respectively
The calculating of TFIDF value.Then ICTCLAS is used to segment Chinese content.ICTCLAS is that Chinese Academy of Sciences's research pushed away in more than ten years
Participle tool out, repeatedly prize-winning, user is numerous.Latest edition can be found in following network address, http: //
ictclas.nlpir.org/。
Part of speech feature:
In text analyzing, part of speech feature also has important influence to classification.In this patent using ICTCLAS to every
The Title of one search result carries out part of speech analysis, counts the number that various parts of speech occur, the number conduct that every kind of part of speech occurs
A kind of feature.
Other features:
It whether include sensitive keys word in URL, this characteristic value occur is 1, does not occur being 0.Sensitive keys word includes such as
It is that this kind of format of pdf, xls, doc is not inconsistent as a result, this kind of news of baidu, weibo.com, 163.com, qq.com, sohu.com
Class website and 360doc, renren, sina, download, news these frequently occur interference correct result keyword
Symbol.
Whether researcher name is occurred in Title, and this characteristic value occur is 1, does not occur being 0.
The researcher's name occurred in Snippet, position is in Snippet first half or latter half.It appears in
First half (including middle) this characteristic value is 1, and latter half characteristic value is 0.
Five, SVM training and test
Test is trained using SVM-light in this patent.SVM-light is Joachim exploitation based on SVM
Open-Source Tools, since the features such as its speed is fast, and accuracy rate is high is widely applied in research and practical application.SVM-light's
Specific descriptions and application method can be in network address http://www.cs.cornell.edu/People/tj/svm_light/
It finds.
The tag file first generated according to training set in experiment learns model out using svm_learn order, then uses svm_
The model that classify order is run out of using front is tested on test set and is obtained a result.
Six, SVM prediction result is combined screening personal homepage with rule
During continuous training study, we obtain the relatively good models of result.With the model to test data
It is predicted, each search result can all obtain a numerical value.Illustrate that it is the possibility of positive example if 1 if the value
Property is bigger, illustrates that a possibility that it makes negative example is bigger if -1 if the value.It may during due to manually marking
There are deviation and the limitation of Google search result reflection web page contents, classification results are simultaneously not fully up to expectations.In order to solve
This problem introduces several regular assisting sifting personal homepages in this patent.
It 1) include detailed date (time-division) information in Snippet
2) name occurs more than three times in snippet
3) name appears in snippet latter half and only occurs in paper partner
For predicting positive example, if there is a kind of situation of any of the above, then this prediction score value is subtracted 0.3.It handles in this way
All score values of classification results are ranked up afterwards, concurrently set some threshold value (such as 0.6), then score value is greater than the net of the threshold value
Page is then considered as the personal homepage of researcher.
In addition, the lookup method of the personal homepage of the embodiment of the present invention other compositions and effect for this field skill
All be for art personnel it is known, in order to reduce redundancy, do not repeat them here.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not
Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any
One or more embodiment or examples in can be combined in any suitable manner.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that: not
A variety of change, modification, replacement and modification can be carried out to these embodiments in the case where being detached from the principle of the present invention and objective, this
The range of invention is by claim and its equivalent limits.
Claims (4)
1. a kind of lookup method of personal homepage, which comprises the following steps:
A: key message is inputted in a search engine and obtains search result, is believed using in described search result closest to the key
For the search result of first preset quantity of breath as data set, described search result includes the first page content of search results pages;
B: extraction section data text is manually marked from the data set, for distinguish whether be target person individual
Homepage;
C: it is divided into the training set of the second preset quantity and the test of third preset quantity to the data text marked
Collection;
D: training set characteristic information is extracted to the training set, wherein the training set characteristic information includes in the training set
The TFIDF value of each word, wherein the calculation formula of the TFIDF are as follows:
Tfidf (t, d, D)=tf (t, d) * idf (t, D)
Wherein t is word, and d represents the article of institute's predicate appearance, and D is entire corpus, and tf represents word frequency, and idf represents reverse file
Frequency;
Wherein, time occurred in the document j for the word i in any document j, the word frequency tf of institute predicate i for institute predicate i
Number ni,jDivided by total word number in the document;The idf value of institute predicate i is the number of files of the corpus divided by including the word
The log value of number of files;The corpus that the title of search result in the training set and web-page summarization are independent of each other as two
Library carries out the calculating of TFIDF value in respective word space respectively;The training set characteristic information further includes part of speech, uses the Chinese
Words and phrases method analysis system carries out part of speech analysis to the title of the search result in training set described in every, counts various parts of speech and occurs
Number;The training set characteristic information further includes other feature, and whether the other feature includes: in URL comprising noise word;
Whether the target person name is occurred in title;And occurs the position of the target person name in web-page summarization;
E: the training set characteristic information is modeled using SVM, obtains the first model;
F: test set characteristic information is extracted to the test set;
G: the test set characteristic information is analyzed using first model, obtains prediction result;
H: the prediction result is judged according to preset personal homepage judgment rule;
I: being iterated step C to step H using ten folding cross validation methods, chooses optimal models;
J: the optimal models are used to judge the search result in the test set whether for the personal homepage of target person.
2. the lookup method of personal homepage according to claim 1, which is characterized in that in step, the key message
Include:
First search phrase, the first search phrase includes target person name and target person unit one belongs to;
Second search phrase;The second search phrase includes the target person name and homepage;And
Third searches for phrase, and the third search phrase includes the target person name and mailbox.
3. the lookup method of personal homepage according to claim 1, which is characterized in that in step E, using SVM-light
First model is established, in step G, the test set feature is believed using the SVM-light and first model
Breath is analyzed.
4. the lookup method of personal homepage according to claim 1, which is characterized in that in steph, the personal homepage
Judgment rule is: if there is any one following situation, the weight of the prediction result is reduced,
H1: including year, month and day information in the web-page summarization;
H2: the target person name occurs more than three times in the web-page summarization;
H3: the target person name appears in the latter half of the web-page summarization, and only occurs in paper partner.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510394587.3A CN105095400B (en) | 2015-07-07 | 2015-07-07 | The lookup method of personal homepage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510394587.3A CN105095400B (en) | 2015-07-07 | 2015-07-07 | The lookup method of personal homepage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105095400A CN105095400A (en) | 2015-11-25 |
CN105095400B true CN105095400B (en) | 2019-02-05 |
Family
ID=54575837
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510394587.3A Active CN105095400B (en) | 2015-07-07 | 2015-07-07 | The lookup method of personal homepage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105095400B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126618B (en) * | 2016-06-22 | 2019-08-09 | 清华大学 | Email address recommended method and system based on name |
CN108733634A (en) * | 2017-04-20 | 2018-11-02 | 北大方正集团有限公司 | The recognition methods of bibliography and identification device |
CN108090223B (en) * | 2018-01-05 | 2020-05-12 | 牛海波 | Openers portrait method based on internet information |
CN112767022B (en) * | 2021-01-13 | 2024-02-27 | 湖南天添汇见企业管理咨询服务有限责任公司 | Mobile application function evolution trend prediction method and device and computer equipment |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1350245A (en) * | 2001-12-03 | 2002-05-22 | 上海交通大学 | Personal homepage content safety monitoring method |
CN102254014B (en) * | 2011-07-21 | 2013-06-05 | 华中科技大学 | Adaptive information extraction method for webpage characteristics |
-
2015
- 2015-07-07 CN CN201510394587.3A patent/CN105095400B/en active Active
Non-Patent Citations (3)
Title |
---|
ECIR-a Lightweight Approach for Entity-centric Information Retrieval;alexander hold et al;《the nineteenth text retrieval conference (TREC 2010) proceedings》;20101231;第1-9页 |
中文专家实体主页识别方法研究;李丽娜 等;《广西师范大学学报:自然科学版》;20110331;第29卷(第1期);第157-161页 |
学术主页信息抽取系统的研究;李毅;《中国优秀硕士学位论文全文数据库信息科技辑》;20120715(第07期);第I139-425页 |
Also Published As
Publication number | Publication date |
---|---|
CN105095400A (en) | 2015-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sharma et al. | A brief review on search engine optimization | |
CN104965905B (en) | A kind of method and apparatus of Web page classifying | |
CN103177090B (en) | A kind of topic detection method and device based on big data | |
CN103473280B (en) | Method for mining comparable network language materials | |
CN107885793A (en) | A kind of hot microblog topic analyzing and predicting method and system | |
CN105095400B (en) | The lookup method of personal homepage | |
CN106156372B (en) | A kind of classification method and device of internet site | |
CN110543595B (en) | In-station searching system and method | |
Chen et al. | Building and analyzing a global co-authorship network using google scholar data | |
Nikhil et al. | A survey on text mining and sentiment analysis for unstructured web data | |
Prajapati | A survey paper on hyperlink-induced topic search (HITS) algorithms for web mining | |
CN103116635A (en) | Field-oriented method and system for collecting invisible web resources | |
CN103914538B (en) | theme capturing method based on anchor text context and link analysis | |
CN110321471A (en) | A kind of internet techno-financial intelligent Matching method based on the convergence of policy resource | |
Dueñas-Fernández et al. | Detecting trends on the web: A multidisciplinary approach | |
Patil et al. | Search engine optimization technique importance | |
Li et al. | CoWS: An Internet-enriched and quality-aware Web services search engine | |
KR101401175B1 (en) | Method and system for text mining using weighted term frequency | |
Yan et al. | An improved PageRank method based on genetic algorithm for web search | |
CN103678601A (en) | Model essay retrieval request processing method and device | |
Kang et al. | Modeling web crawler wrappers to collect user reviews on shopping mall with various hierarchical tree structure | |
CN111709238A (en) | Web page geoscience correlation calculation method based on geoscience expert knowledge | |
Patil et al. | Implementation of enhanced web crawler for deep-web interfaces | |
CN103699602B (en) | A kind of method and apparatus for setting up model essay webpage database | |
JP2018206189A (en) | Information collection device and information collection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |