CN101261629A

CN101261629A - Specific information searching method based on automatic classification technology

Info

Publication number: CN101261629A
Application number: CNA2008100363692A
Authority: CN
Inventors: 孟浩华; 曾雪强; 李国正
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2008-04-21
Filing date: 2008-04-21
Publication date: 2008-09-10

Abstract

The invention relates to a special information searching method based on an automatic classifying technique. The method has a technical proposal that: firstly, a web spider is adopted to collect a plurality of typical web pages to form a training file collection, then field relevant web pages or field irrelevant web pages are marked manually to the web pages in the training collection, and then a machine learning arithmetic is adopted for modeling on the training collection to obtain an automatic web page classifying device; then, the web spider is adopted to largely collect the relevant web pages of the field, the established automatic web page classifying device is utilized for judging whether the web pages are the relevant web pages of the field, and a full-text index bank based on an inverted list is established to store the relevant web pages; finally, an index interface is provided to provide convenience for users to inquire about the field relevant pages from the full-text index bank. The operation details include the following three modules: a classifying device training module, a web page collection and index module and an information searching module. Compared with the universal searching method, the method of the invention has a higher searching target-hit rate, less repeated information and the prior ranking of relevant information.

Description

Specific information searching method based on automatic classification technology

Technical field

Content of the present invention is a kind of information search method at particular professional field, relates to based on correlation techniques such as the full-text search of inverted list and text automatic classifications.

Background technology

The explosive growth of Internet makes the various information resources on the network more and more abundanter, has brought difficulty but obtain resource also for people on network, and people are difficult in the complicated information ocean of confused wadding and find the information that needs easily.This practical problems of the numerous network users occurs searching method in order to solve just.Generally speaking, searching method is a kind of application system on WWW, and it receives the information requirement that the user submits to, attempts to allow the user obtain in the limited time and the maximally related information of its demand.

Nowadays Internet goes up searching method commonly used Yahoo, Google, Baidu or the like.But the strategy that these universal method engines are collected is not considered the specific demand of user for customizing messages, promptly is difficult to accomplish precision and specialization.Generally speaking, the overwhelming majority all is and the incoherent webpage of user's request among the result of universal search method inquiry.Based on this defective, increasing professional searching products also constantly emerges in large numbers, such as music searching, and lyrics search, picture searching, the search of multimedia files such as video etc.

But most of professional searching method all is based on the professional knowledge of file type or other specific areas, do not occur as yet a kind of can be at the specific information searching method constructing technology in any given field.

Summary of the invention

The object of the present invention is to provide a kind of can be at the specific information searching method based on automatic classification technology in any specific field, determine the webpage that really comprises specific area information among the Internet by comparatively ripe textual classification model, foundation is based on these webpages of full-text index library storage of inverted list, and provides professional search interface based on full-text search to the user.

In order to achieve this end, the present invention adopts following technical proposals: a kind of specific information searching method based on automatic classification technology, it is characterized in that at first collecting some typical webpage composing training collection of document by Web Spider (spider), again the webpage in the training set is manually marked (the irrelevant webpage in field related web page or field), utilize machine learning algorithm to close then and carry out modeling and obtain the automatic webpage classification device at training set; Then, collect this field related web page (utilizing the automatic categorizer set up previously to differentiate whether this field related web page of webpage) in a large number by Web Spider again, and set up and these related web pages are preserved based on the full-text index storehouse of inverted list; At last, provide a Retrieval Interface, make things convenient for the user from the full-text index storehouse, to inquire this field related web page;

Concrete operations comprise following three modules: sorter training module, webpage collection and index module and information searching module.

The function of above-mentioned sorter training module is to obtain one can differentiate the automatic sort module whether webpage is " field is relevant " automatically, and its corresponding process flow diagram is seen Fig. 1;

Concrete steps are as follows:

A) gather the representational training webpage of some by Web Spider;

B) webpage is manually marked; Related personnel that please this field is divided into webpage " field related web page " and " the irrelevant webpage in field " two classes (owing to be simple two classes mark, not high to this field related personnel's requirement);

C) webpage is carried out pre-service, set up training document matrix based on vector space model; Concrete processing operation comprises, removes the HTML mark, removes irrelevant information, Chinese word segmentation, the removal stop words in the webpage and sets up the document vector;

D) sorter training; Adopt higher Support Vector Machine (SVM) disaggregated model of classify accuracy to carry out sorter training (the svm classifier device is a kind of disaggregated model that extensively adopts in the machine learning field, and the classification accuracy under situation about training up can reach more than 90%);

E) disaggregated model is preserved; Required relevant information when preserving classifier parameters and setting up the document vector.

The above-mentioned webpage collection and the function of index module are the full-text index storehouses based on inverted list that obtains a field related web page, and for user's information retrieval provides Data Source, its corresponding process flow diagram is seen Fig. 2;

Concrete steps are as follows:

A) gather the magnanimity webpage by Web Spider; For Web Spider is set some particular Web page as initial searched page, Web Spider reads the content of these start page and extracts wherein hyperlink address, seek the next page by these chained addresses then, circulation is so always gone down, and stops the webpage collection up to triggering certain end condition;

B) webpage of gathering is carried out pre-service, set up the document vector under the vector space model; Concrete operation comprises, removes the HTML mark, removes irrelevant information, Chinese word segmentation, the removal stop words in the webpage and sets up document vector (need use the relevant information of setting up the document vector matrix in the training process);

C) webpage of gathering is differentiated; The svm classifier model that utilizes " sorter training module " to set up is differentiated the document vector, abandons the webpage of differentiating for " field is irrelevant ", only keeps the webpage of wherein differentiating for " field is relevant ";

D) set up the full-text index storehouse; Structure is preserved the webpage of " field is relevant " based on the full-text index storehouse of inverted list technology; In order to adapt to the needs that magnanimity information is preserved, adopt the full-text index engine instrument Lucene that increases income to set up index data base.

The function of above-mentioned information searching module is for the user provides an interface that carries out the specific area information retrieval, inquires the relevant webpage in this field and present to the user from the full-text index storehouse, and its corresponding process flow diagram is seen Fig. 3;

The concrete steps of this module are as follows:

A) user provides querying condition; The WEB query interface input inquiry condition that the user provides in system;

B) querying condition analyzing and processing; Comprise the separation (participle) of querying condition, the analyzing and processing step of " non-, with or " combination condition finally obtains a querying condition after the processing;

C) full-text search; According to given querying condition, the qualified webpage of inquiry in the full-text index storehouse; Inquiry mode comprises traditional keyword matching and the semantic query of expanding based on synonym;

D) result presents; The related web page that retrieves according to relevancy ranking, is presented to the user in the tabulation mode in webpage.

The present invention compared with prior art, have following conspicuous outstanding substantive distinguishing features and remarkable advantage: professional searching method provided by the invention is to determine the webpage that really comprises specific area information among the Internet by comparatively ripe textual classification model, foundation is based on these webpages of full-text index library storage of inverted list, and provides professional search interface based on full-text search to the user.Method of the present invention is with respect to the universal search method, and the search hit rate is higher, and duplicate message is less, and relevant information is arranged more forward.

Description of drawings

Fig. 1 is an information acquisition training module block diagram.

Fig. 2 is information acquisition and memory module block diagram.

Fig. 3 is the information searching module block diagram.

Fig. 4 is the searched page diagrammatic sketch.

Fig. 5 is embodiment of the invention search result list figure.

Fig. 6 is the search result list figure of Baidu.

Fig. 7 is google search result list figure.

Embodiment

A preferred embodiment of the present invention is to be example with the field, hotel, and we have developed a hotel information searching method.This project is put into commercial operation as a submodule of hotel's search engine web site (http://www.hotelgoogle.com.cn/) of GreenTree Inn hotel Shanghai company limited investment.Introduce the realization flow of hotel information searching method below.

This professional searching method based on automatic classification technology is to collect some typical webpage composing training collection of document by Web Spider earlier, again the webpage in the training set is manually marked the irrelevant webpage of field related web page or field, utilize machine learning algorithm to close then and carry out modeling and obtain the automatic webpage classification device at training set; Then, collect this field related web page in a large number by Web Spider again, utilize the automatic categorizer set up previously to differentiate whether this field related web page of webpage, and set up and these related web pages are preserved based on the full-text index storehouse of inverted list; At last, provide a Retrieval Interface, make things convenient for the user from the full-text index storehouse, to inquire this field related web page; Concrete operations comprise following three modules: sorter training module, webpage collection and index module and information searching module.

Referring to Fig. 1, above-mentioned sorter training module:

Adopt " hotel " to search on google for key word, search webpage by the Web Spider collection, obtain more than 6000 webpage altogether, remove the webpage of some mess codes or link fails, finishing screen is selected more than 4000 webpage.

These more than 4000 webpages of artificial judgment are divided into it " hotel's associated class " and " the irrelevant class in hotel ", for the webpage of " hotel's associated class " it are labeled as 1, and the webpage of " the irrelevant class in hotel " is labeled as it-1.Mark carries out pre-service to webpage after finishing, and comprises removing the HTML mark, remove irrelevant information, Chinese word segmentation, the removal stop words in the webpage and setting up document vector etc.Obtain a training matrix then as training set.Train on training set with SVM at last and obtain disaggregated model and preservation.

Referring to Fig. 2, above-mentioned information acquisition and memory module:

Selected some domestic more well-known hotel brochure class websites, such as taking journey, e dragon travelling net etc., with these webpages as initial searched page.Web Spider reads the content of these start page and extracts wherein hyperlink address, seek the next page by these chained addresses then, circulation is so always gone down, till the maximum web number of pages amount of lasting maximum duration that reaches each collection or extracting.

The webpage that collects is carried out pre-service obtain the document vector, the svm classifier model that utilizes " sorter training module " to set up, the document vector is differentiated, abandoned the webpage of differentiating for " hotel is irrelevant ", only keep the webpage of wherein differentiating for " hotel is relevant ".The full-text index instrument Lucene that employing is increased income sets up index data base to the webpage that remains.

Referring to Fig. 3, above-mentioned information searching module:

The user can enter searched page, is that key word is searched for " GreenTree Inn ", sees Fig. 4.

Click search button, result of page searching will occur, see Fig. 5.

Below with same key word " GreenTree Inn ", in Baidu and google, search for and compare.The search results pages of Baidu and google is met personally Fig. 6 and Fig. 7.

From the angle (being that the user need find the specifying information about " GreenTree Inn " hotel) of user's request, can see that the present invention searches 8 webpages, Baidu searches 1,300,000 webpage, and google searches 484,000 webpages.Consider user's search custom and be convenient to comparison, only get first page here and compare.

In table 2, table 3, table 4, listed the Search Results statistical form of the present invention, Baidu and google respectively.Can see that therefrom hit rate of the present invention is 75%, duplicate message has 1; The hit rate of Baidu is 30%, no duplicate message; The hit rate of google is 70%, and duplicate message has 2.From the Search Results of relevant information in proper order, in the Search Results of the present invention, the order relevant with hotel information is 1-5,8; Baidu be 4,8,9; That google is 3-9.

From as can be seen above-mentioned, the present invention is with respect to Baidu and the such universal search method of google, and the search hit rate is higher, and duplicate message is less, and relevant information is arranged more forward.

Claims

1. specific information searching method based on automatic classification technology, it is characterized in that at first collecting some typical webpage composing training collection of document by Web Spider, again the webpage in the training set is manually marked the irrelevant webpage of field related web page or field, utilize machine learning algorithm to close then and carry out modeling and obtain the automatic webpage classification device at training set; Then, collect this specific area related web page in a large number by Web Spider again, utilize the automatic categorizer set up previously to differentiate whether this field related web page of webpage, and set up and these related web pages are preserved based on the full-text index storehouse of inverted list; At last, provide a Retrieval Interface, make things convenient for the user from the full-text index storehouse, to inquire this field related web page; Concrete operations comprise following three modules: sorter training module, webpage collection and index module and information searching module.

2. the specific information searching method based on automatic classification technology according to claim 1, the function that it is characterized in that described sorter training module is to obtain one can differentiate the automatic sort module whether webpage is " field is relevant " automatically, and concrete steps are as follows:

A) gather the representational training webpage of some by Web Spider;

B) webpage is manually marked; Related personnel that please this field is divided into webpage " field related web page " and " the irrelevant webpage in field " two classes;

C) webpage is carried out pre-service, set up training document matrix based on vector space model; Concrete processing operation comprises: remove the HTML mark, remove irrelevant information, Chinese word segmentation, the removal stop words in the webpage and set up the document vector;

D) sorter training; Adopt the higher svm classifier model of classify accuracy to carry out the sorter training;

3. the specific information searching method based on the mechanized classification technology according to claim 1, the function that it is characterized in that described webpage collection and index module is the full-text index storehouse based on inverted list that obtains a field related web page, for user's information retrieval provides Data Source; Concrete steps are as follows:

B) webpage of gathering is carried out pre-service, set up the document vector under the vector space model; Concrete operation comprises, removes the HTML mark, removes irrelevant information, Chinese word segmentation, the removal stop words in the webpage and sets up the document vector;

4. the specific information searching method based on automatic classification technology according to claim 1, the function that it is characterized in that described information searching module is for the user provides an interface that carries out the specific area information retrieval, inquires the relevant webpage in this field and present to the user from the full-text index storehouse; Concrete steps are as follows:

B) querying condition analyzing and processing; Comprise the separation of querying condition, the analyzing and processing step of " non-, with or " combination condition finally obtains a querying condition after the processing;