CN101261629A - Specific information searching method based on automatic classification technology - Google Patents

Specific information searching method based on automatic classification technology Download PDF

Info

Publication number
CN101261629A
CN101261629A CNA2008100363692A CN200810036369A CN101261629A CN 101261629 A CN101261629 A CN 101261629A CN A2008100363692 A CNA2008100363692 A CN A2008100363692A CN 200810036369 A CN200810036369 A CN 200810036369A CN 101261629 A CN101261629 A CN 101261629A
Authority
CN
China
Prior art keywords
webpage
field
full
training
relevant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008100363692A
Other languages
Chinese (zh)
Inventor
孟浩华
曾雪强
李国正
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CNA2008100363692A priority Critical patent/CN101261629A/en
Publication of CN101261629A publication Critical patent/CN101261629A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a special information searching method based on an automatic classifying technique. The method has a technical proposal that: firstly, a web spider is adopted to collect a plurality of typical web pages to form a training file collection, then field relevant web pages or field irrelevant web pages are marked manually to the web pages in the training collection, and then a machine learning arithmetic is adopted for modeling on the training collection to obtain an automatic web page classifying device; then, the web spider is adopted to largely collect the relevant web pages of the field, the established automatic web page classifying device is utilized for judging whether the web pages are the relevant web pages of the field, and a full-text index bank based on an inverted list is established to store the relevant web pages; finally, an index interface is provided to provide convenience for users to inquire about the field relevant pages from the full-text index bank. The operation details include the following three modules: a classifying device training module, a web page collection and index module and an information searching module. Compared with the universal searching method, the method of the invention has a higher searching target-hit rate, less repeated information and the prior ranking of relevant information.

Description

Specific information searching method based on automatic classification technology
Technical field
Content of the present invention is a kind of information search method at particular professional field, relates to based on correlation techniques such as the full-text search of inverted list and text automatic classifications.
Background technology
The explosive growth of Internet makes the various information resources on the network more and more abundanter, has brought difficulty but obtain resource also for people on network, and people are difficult in the complicated information ocean of confused wadding and find the information that needs easily.This practical problems of the numerous network users occurs searching method in order to solve just.Generally speaking, searching method is a kind of application system on WWW, and it receives the information requirement that the user submits to, attempts to allow the user obtain in the limited time and the maximally related information of its demand.
Nowadays Internet goes up searching method commonly used Yahoo, Google, Baidu or the like.But the strategy that these universal method engines are collected is not considered the specific demand of user for customizing messages, promptly is difficult to accomplish precision and specialization.Generally speaking, the overwhelming majority all is and the incoherent webpage of user's request among the result of universal search method inquiry.Based on this defective, increasing professional searching products also constantly emerges in large numbers, such as music searching, and lyrics search, picture searching, the search of multimedia files such as video etc.
But most of professional searching method all is based on the professional knowledge of file type or other specific areas, do not occur as yet a kind of can be at the specific information searching method constructing technology in any given field.
Summary of the invention
The object of the present invention is to provide a kind of can be at the specific information searching method based on automatic classification technology in any specific field, determine the webpage that really comprises specific area information among the Internet by comparatively ripe textual classification model, foundation is based on these webpages of full-text index library storage of inverted list, and provides professional search interface based on full-text search to the user.
In order to achieve this end, the present invention adopts following technical proposals: a kind of specific information searching method based on automatic classification technology, it is characterized in that at first collecting some typical webpage composing training collection of document by Web Spider (spider), again the webpage in the training set is manually marked (the irrelevant webpage in field related web page or field), utilize machine learning algorithm to close then and carry out modeling and obtain the automatic webpage classification device at training set; Then, collect this field related web page (utilizing the automatic categorizer set up previously to differentiate whether this field related web page of webpage) in a large number by Web Spider again, and set up and these related web pages are preserved based on the full-text index storehouse of inverted list; At last, provide a Retrieval Interface, make things convenient for the user from the full-text index storehouse, to inquire this field related web page;
Concrete operations comprise following three modules: sorter training module, webpage collection and index module and information searching module.
The function of above-mentioned sorter training module is to obtain one can differentiate the automatic sort module whether webpage is " field is relevant " automatically, and its corresponding process flow diagram is seen Fig. 1;
Concrete steps are as follows:
A) gather the representational training webpage of some by Web Spider;
B) webpage is manually marked; Related personnel that please this field is divided into webpage " field related web page " and " the irrelevant webpage in field " two classes (owing to be simple two classes mark, not high to this field related personnel's requirement);
C) webpage is carried out pre-service, set up training document matrix based on vector space model; Concrete processing operation comprises, removes the HTML mark, removes irrelevant information, Chinese word segmentation, the removal stop words in the webpage and sets up the document vector;
D) sorter training; Adopt higher Support Vector Machine (SVM) disaggregated model of classify accuracy to carry out sorter training (the svm classifier device is a kind of disaggregated model that extensively adopts in the machine learning field, and the classification accuracy under situation about training up can reach more than 90%);
E) disaggregated model is preserved; Required relevant information when preserving classifier parameters and setting up the document vector.
The above-mentioned webpage collection and the function of index module are the full-text index storehouses based on inverted list that obtains a field related web page, and for user's information retrieval provides Data Source, its corresponding process flow diagram is seen Fig. 2;
Concrete steps are as follows:
A) gather the magnanimity webpage by Web Spider; For Web Spider is set some particular Web page as initial searched page, Web Spider reads the content of these start page and extracts wherein hyperlink address, seek the next page by these chained addresses then, circulation is so always gone down, and stops the webpage collection up to triggering certain end condition;
B) webpage of gathering is carried out pre-service, set up the document vector under the vector space model; Concrete operation comprises, removes the HTML mark, removes irrelevant information, Chinese word segmentation, the removal stop words in the webpage and sets up document vector (need use the relevant information of setting up the document vector matrix in the training process);
C) webpage of gathering is differentiated; The svm classifier model that utilizes " sorter training module " to set up is differentiated the document vector, abandons the webpage of differentiating for " field is irrelevant ", only keeps the webpage of wherein differentiating for " field is relevant ";
D) set up the full-text index storehouse; Structure is preserved the webpage of " field is relevant " based on the full-text index storehouse of inverted list technology; In order to adapt to the needs that magnanimity information is preserved, adopt the full-text index engine instrument Lucene that increases income to set up index data base.
The function of above-mentioned information searching module is for the user provides an interface that carries out the specific area information retrieval, inquires the relevant webpage in this field and present to the user from the full-text index storehouse, and its corresponding process flow diagram is seen Fig. 3;
The concrete steps of this module are as follows:
A) user provides querying condition; The WEB query interface input inquiry condition that the user provides in system;
B) querying condition analyzing and processing; Comprise the separation (participle) of querying condition, the analyzing and processing step of " non-, with or " combination condition finally obtains a querying condition after the processing;
C) full-text search; According to given querying condition, the qualified webpage of inquiry in the full-text index storehouse; Inquiry mode comprises traditional keyword matching and the semantic query of expanding based on synonym;
D) result presents; The related web page that retrieves according to relevancy ranking, is presented to the user in the tabulation mode in webpage.
The present invention compared with prior art, have following conspicuous outstanding substantive distinguishing features and remarkable advantage: professional searching method provided by the invention is to determine the webpage that really comprises specific area information among the Internet by comparatively ripe textual classification model, foundation is based on these webpages of full-text index library storage of inverted list, and provides professional search interface based on full-text search to the user.Method of the present invention is with respect to the universal search method, and the search hit rate is higher, and duplicate message is less, and relevant information is arranged more forward.
Description of drawings
Fig. 1 is an information acquisition training module block diagram.
Fig. 2 is information acquisition and memory module block diagram.
Fig. 3 is the information searching module block diagram.
Fig. 4 is the searched page diagrammatic sketch.
Fig. 5 is embodiment of the invention search result list figure.
Fig. 6 is the search result list figure of Baidu.
Fig. 7 is google search result list figure.
Embodiment
A preferred embodiment of the present invention is to be example with the field, hotel, and we have developed a hotel information searching method.This project is put into commercial operation as a submodule of hotel's search engine web site (http://www.hotelgoogle.com.cn/) of GreenTree Inn hotel Shanghai company limited investment.Introduce the realization flow of hotel information searching method below.
This professional searching method based on automatic classification technology is to collect some typical webpage composing training collection of document by Web Spider earlier, again the webpage in the training set is manually marked the irrelevant webpage of field related web page or field, utilize machine learning algorithm to close then and carry out modeling and obtain the automatic webpage classification device at training set; Then, collect this field related web page in a large number by Web Spider again, utilize the automatic categorizer set up previously to differentiate whether this field related web page of webpage, and set up and these related web pages are preserved based on the full-text index storehouse of inverted list; At last, provide a Retrieval Interface, make things convenient for the user from the full-text index storehouse, to inquire this field related web page; Concrete operations comprise following three modules: sorter training module, webpage collection and index module and information searching module.
Referring to Fig. 1, above-mentioned sorter training module:
Adopt " hotel " to search on google for key word, search webpage by the Web Spider collection, obtain more than 6000 webpage altogether, remove the webpage of some mess codes or link fails, finishing screen is selected more than 4000 webpage.
These more than 4000 webpages of artificial judgment are divided into it " hotel's associated class " and " the irrelevant class in hotel ", for the webpage of " hotel's associated class " it are labeled as 1, and the webpage of " the irrelevant class in hotel " is labeled as it-1.Mark carries out pre-service to webpage after finishing, and comprises removing the HTML mark, remove irrelevant information, Chinese word segmentation, the removal stop words in the webpage and setting up document vector etc.Obtain a training matrix then as training set.Train on training set with SVM at last and obtain disaggregated model and preservation.
Referring to Fig. 2, above-mentioned information acquisition and memory module:
Selected some domestic more well-known hotel brochure class websites, such as taking journey, e dragon travelling net etc., with these webpages as initial searched page.Web Spider reads the content of these start page and extracts wherein hyperlink address, seek the next page by these chained addresses then, circulation is so always gone down, till the maximum web number of pages amount of lasting maximum duration that reaches each collection or extracting.
The webpage that collects is carried out pre-service obtain the document vector, the svm classifier model that utilizes " sorter training module " to set up, the document vector is differentiated, abandoned the webpage of differentiating for " hotel is irrelevant ", only keep the webpage of wherein differentiating for " hotel is relevant ".The full-text index instrument Lucene that employing is increased income sets up index data base to the webpage that remains.
Referring to Fig. 3, above-mentioned information searching module:
The user can enter searched page, is that key word is searched for " GreenTree Inn ", sees Fig. 4.
Click search button, result of page searching will occur, see Fig. 5.
Below with same key word " GreenTree Inn ", in Baidu and google, search for and compare.The search results pages of Baidu and google is met personally Fig. 6 and Fig. 7.
From the angle (being that the user need find the specifying information about " GreenTree Inn " hotel) of user's request, can see that the present invention searches 8 webpages, Baidu searches 1,300,000 webpage, and google searches 484,000 webpages.Consider user's search custom and be convenient to comparison, only get first page here and compare.
In table 2, table 3, table 4, listed the Search Results statistical form of the present invention, Baidu and google respectively.Can see that therefrom hit rate of the present invention is 75%, duplicate message has 1; The hit rate of Baidu is 30%, no duplicate message; The hit rate of google is 70%, and duplicate message has 2.From the Search Results of relevant information in proper order, in the Search Results of the present invention, the order relevant with hotel information is 1-5,8; Baidu be 4,8,9; That google is 3-9.
From as can be seen above-mentioned, the present invention is with respect to Baidu and the such universal search method of google, and the search hit rate is higher, and duplicate message is less, and relevant information is arranged more forward.

Claims (4)

1. specific information searching method based on automatic classification technology, it is characterized in that at first collecting some typical webpage composing training collection of document by Web Spider, again the webpage in the training set is manually marked the irrelevant webpage of field related web page or field, utilize machine learning algorithm to close then and carry out modeling and obtain the automatic webpage classification device at training set; Then, collect this specific area related web page in a large number by Web Spider again, utilize the automatic categorizer set up previously to differentiate whether this field related web page of webpage, and set up and these related web pages are preserved based on the full-text index storehouse of inverted list; At last, provide a Retrieval Interface, make things convenient for the user from the full-text index storehouse, to inquire this field related web page; Concrete operations comprise following three modules: sorter training module, webpage collection and index module and information searching module.
2. the specific information searching method based on automatic classification technology according to claim 1, the function that it is characterized in that described sorter training module is to obtain one can differentiate the automatic sort module whether webpage is " field is relevant " automatically, and concrete steps are as follows:
A) gather the representational training webpage of some by Web Spider;
B) webpage is manually marked; Related personnel that please this field is divided into webpage " field related web page " and " the irrelevant webpage in field " two classes;
C) webpage is carried out pre-service, set up training document matrix based on vector space model; Concrete processing operation comprises: remove the HTML mark, remove irrelevant information, Chinese word segmentation, the removal stop words in the webpage and set up the document vector;
D) sorter training; Adopt the higher svm classifier model of classify accuracy to carry out the sorter training;
E) disaggregated model is preserved; Required relevant information when preserving classifier parameters and setting up the document vector.
3. the specific information searching method based on the mechanized classification technology according to claim 1, the function that it is characterized in that described webpage collection and index module is the full-text index storehouse based on inverted list that obtains a field related web page, for user's information retrieval provides Data Source; Concrete steps are as follows:
A) gather the magnanimity webpage by Web Spider; For Web Spider is set some particular Web page as initial searched page, Web Spider reads the content of these start page and extracts wherein hyperlink address, seek the next page by these chained addresses then, circulation is so always gone down, and stops the webpage collection up to triggering certain end condition;
B) webpage of gathering is carried out pre-service, set up the document vector under the vector space model; Concrete operation comprises, removes the HTML mark, removes irrelevant information, Chinese word segmentation, the removal stop words in the webpage and sets up the document vector;
C) webpage of gathering is differentiated; The svm classifier model that utilizes " sorter training module " to set up is differentiated the document vector, abandons the webpage of differentiating for " field is irrelevant ", only keeps the webpage of wherein differentiating for " field is relevant ";
D) set up the full-text index storehouse; Structure is preserved the webpage of " field is relevant " based on the full-text index storehouse of inverted list technology; In order to adapt to the needs that magnanimity information is preserved, adopt the full-text index engine instrument Lucene that increases income to set up index data base.
4. the specific information searching method based on automatic classification technology according to claim 1, the function that it is characterized in that described information searching module is for the user provides an interface that carries out the specific area information retrieval, inquires the relevant webpage in this field and present to the user from the full-text index storehouse; Concrete steps are as follows:
A) user provides querying condition; The WEB query interface input inquiry condition that the user provides in system;
B) querying condition analyzing and processing; Comprise the separation of querying condition, the analyzing and processing step of " non-, with or " combination condition finally obtains a querying condition after the processing;
C) full-text search; According to given querying condition, the qualified webpage of inquiry in the full-text index storehouse; Inquiry mode comprises traditional keyword matching and the semantic query of expanding based on synonym;
D) result presents; The related web page that retrieves according to relevancy ranking, is presented to the user in the tabulation mode in webpage.
CNA2008100363692A 2008-04-21 2008-04-21 Specific information searching method based on automatic classification technology Pending CN101261629A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008100363692A CN101261629A (en) 2008-04-21 2008-04-21 Specific information searching method based on automatic classification technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008100363692A CN101261629A (en) 2008-04-21 2008-04-21 Specific information searching method based on automatic classification technology

Publications (1)

Publication Number Publication Date
CN101261629A true CN101261629A (en) 2008-09-10

Family

ID=39962089

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008100363692A Pending CN101261629A (en) 2008-04-21 2008-04-21 Specific information searching method based on automatic classification technology

Country Status (1)

Country Link
CN (1) CN101261629A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054015A (en) * 2009-10-28 2011-05-11 财团法人工业技术研究院 System and method of organizing community intelligent information by using organic matter data model
CN102054016A (en) * 2009-10-28 2011-05-11 财团法人工业技术研究院 Systems and methods for capturing and managing collective social intelligence information
CN102236719A (en) * 2011-07-25 2011-11-09 西交利物浦大学 Page search engine based on page classification and quick search method
CN102279887A (en) * 2011-08-18 2011-12-14 北京百度网讯科技有限公司 Method, device and system for classifying documents
CN101777060B (en) * 2009-12-23 2012-05-23 中国科学院自动化研究所 Webpage classification method and system based on webpage visual characteristics
CN101908047B (en) * 2009-06-08 2012-05-30 北京搜狗科技发展有限公司 Invalid template generation method and device as well as invalid web page identification method and device
CN102945246A (en) * 2012-09-28 2013-02-27 北界创想(北京)软件有限公司 Method and device for processing network information data
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
CN103299304A (en) * 2011-01-13 2013-09-11 三菱电机株式会社 Classification rule generation device, classification rule generation method, classification rule generation program and recording medium
CN103793444A (en) * 2012-11-05 2014-05-14 江苏苏大大数据科技有限公司 Method for acquiring user requirements
CN104123366A (en) * 2014-07-23 2014-10-29 谢建平 Search method and server
CN105304084A (en) * 2015-11-13 2016-02-03 深圳狗尾草智能科技有限公司 Method for enabling robot to remember strongly-relevant information of master
WO2017118427A1 (en) * 2016-01-07 2017-07-13 腾讯科技(深圳)有限公司 Webpage training method and device, and search intention identification method and device
CN108198268A (en) * 2017-12-19 2018-06-22 江苏极熵物联科技有限公司 A kind of production equipment data scaling method
CN109271523A (en) * 2018-11-23 2019-01-25 中电科大数据研究院有限公司 A kind of government document subject classification method based on information retrieval
CN109684529A (en) * 2018-12-14 2019-04-26 安徽仁昊智能科技有限公司 A kind of intelligent learning system neural network based
US20210073839A1 (en) * 2017-12-12 2021-03-11 Drilling Info, Inc. Map visualization for well data

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908047B (en) * 2009-06-08 2012-05-30 北京搜狗科技发展有限公司 Invalid template generation method and device as well as invalid web page identification method and device
CN102054016B (en) * 2009-10-28 2016-01-20 财团法人工业技术研究院 For capturing and manage the system and method for community intelligent information
CN102054016A (en) * 2009-10-28 2011-05-11 财团法人工业技术研究院 Systems and methods for capturing and managing collective social intelligence information
CN102054015A (en) * 2009-10-28 2011-05-11 财团法人工业技术研究院 System and method of organizing community intelligent information by using organic matter data model
CN102054015B (en) * 2009-10-28 2014-05-07 财团法人工业技术研究院 System and method of organizing community intelligent information by using organic matter data model
CN101777060B (en) * 2009-12-23 2012-05-23 中国科学院自动化研究所 Webpage classification method and system based on webpage visual characteristics
CN103299304B (en) * 2011-01-13 2016-09-28 三菱电机株式会社 Classifying rules generating means and classifying rules generate method
CN103299304A (en) * 2011-01-13 2013-09-11 三菱电机株式会社 Classification rule generation device, classification rule generation method, classification rule generation program and recording medium
CN102236719A (en) * 2011-07-25 2011-11-09 西交利物浦大学 Page search engine based on page classification and quick search method
CN102279887A (en) * 2011-08-18 2011-12-14 北京百度网讯科技有限公司 Method, device and system for classifying documents
CN102279887B (en) * 2011-08-18 2016-06-01 北京百度网讯科技有限公司 A kind of Document Classification Method, Apparatus and system
CN102945246A (en) * 2012-09-28 2013-02-27 北界创想(北京)软件有限公司 Method and device for processing network information data
CN103793444A (en) * 2012-11-05 2014-05-14 江苏苏大大数据科技有限公司 Method for acquiring user requirements
CN103793444B (en) * 2012-11-05 2017-02-08 江苏苏大大数据科技有限公司 Method for acquiring user requirements
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
CN104123366A (en) * 2014-07-23 2014-10-29 谢建平 Search method and server
CN105304084A (en) * 2015-11-13 2016-02-03 深圳狗尾草智能科技有限公司 Method for enabling robot to remember strongly-relevant information of master
CN105304084B (en) * 2015-11-13 2020-04-24 深圳狗尾草智能科技有限公司 Method for realizing strong relevant information memory of master by robot
WO2017118427A1 (en) * 2016-01-07 2017-07-13 腾讯科技(深圳)有限公司 Webpage training method and device, and search intention identification method and device
CN106951422A (en) * 2016-01-07 2017-07-14 腾讯科技(深圳)有限公司 The method and apparatus of webpage training, the method and apparatus of search intention identification
US20210073839A1 (en) * 2017-12-12 2021-03-11 Drilling Info, Inc. Map visualization for well data
CN108198268A (en) * 2017-12-19 2018-06-22 江苏极熵物联科技有限公司 A kind of production equipment data scaling method
CN109271523A (en) * 2018-11-23 2019-01-25 中电科大数据研究院有限公司 A kind of government document subject classification method based on information retrieval
CN109684529A (en) * 2018-12-14 2019-04-26 安徽仁昊智能科技有限公司 A kind of intelligent learning system neural network based

Similar Documents

Publication Publication Date Title
CN101261629A (en) Specific information searching method based on automatic classification technology
EP2159715B1 (en) System and method for providing a topic-directed search
Szomszor et al. Semantic modelling of user interests based on cross-folksonomy analysis
CN111708740A (en) Mass search query log calculation analysis system based on cloud platform
CN104063497B (en) Viewpoint treating method and apparatus and searching method and device
JP2005085285A5 (en)
CN105095187A (en) Search intention identification method and device
CN102855282B (en) A kind of document recommendation method and device
US9971828B2 (en) Document tagging and retrieval using per-subject dictionaries including subject-determining-power scores for entries
Vavliakis et al. Event Detection via LDA for the MediaEval2012 SED Task.
CN107977420A (en) The abstract extraction method, apparatus and readable storage medium storing program for executing of a kind of evolved document
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
Ferragina et al. The anatomy of a hierarchical clustering engine for Web-page, news and book snippets
JP2013168177A (en) Information provision program, information provision apparatus, and provision method of retrieval service
Duhan et al. A novel approach for organizing web search results using ranking and clustering
CN103034709B (en) Retrieving result reordering system and method
Kang Transactional query identification in web search
CN114238735B (en) Intelligent internet data acquisition method
JP2009211429A (en) Information provision method, information provision apparatus, information provision program and recording medium having the program recorded in computer
Park et al. Topic word selection for blogs by topic richness using web search result clustering
CN106294442A (en) A kind of internet information classifying identification method based on URL and system
JP5903370B2 (en) Information search apparatus, information search method, and program
TWI423053B (en) Domain Interpretation Data Retrieval Method and Its System
Kavitha et al. Survey On Inferring User Search Goal Using Feedback Session
Li et al. Clustering web search results using conceptual grouping

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20080910