CN103218443A

CN103218443A - Blogging webpage retrieval system and retrieval method

Info

Publication number: CN103218443A
Application number: CN2013101417845A
Authority: CN
Inventors: 罗笑南; 曾金龙; 林格
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2013-04-22
Filing date: 2013-04-22
Publication date: 2013-07-24

Abstract

The invention discloses a blogging webpage retrieval system and a retrieval method. The blogging webpage retrieval system comprises an information extraction module, a data reduction module, an indexing module and a retrieval module, wherein the information extraction module is used for acquiring webpages relevant to a blog theme, the data reduction module is used for carrying out structuralized information extraction and de-duplication on the initial webpages acquired by the information extraction module, the indexing module is used for creating an index of data extracted by the data reduction module, and the retrieval module is used for providing a retrieval port for a user and carrying out retrieval according to the index and sorting retrieval results. According to the blogging webpage retrieval system and the retrieval method, data storage is mapped in a plurality of serves by means of the Hash mapping modular method, load balancing of all the storage servers can be well guaranteed, the webpages relevant to the blog theme searched by the user can be well returned by means of the blog theme relevance measuring method, searching accuracy can be effectively improved, and webpages irrelevant to the theme can be eliminated.

Description

A kind of web search system and method towards blog web page

Technical field

The present invention relates to the web search technical field, relate in particular to a kind of web search system and method towards blog web page.

Background technology

In the past few years, obtained great success based on the search engine of internet, and also obtained huge repayment based on the Google company that search engine is built up a family fortune, the advertising income of Google every day just surpasses 100,000,000 U.S. dollars.The domestic search engine of China is also because 360 and the Great War of Baidu and present the animated scene, and increasing company puts in the war of search engine, because it is the same with browser, all is the inlet of internet.Yet different companies all is to carry out strict maintaining secrecy to its core technology, and its implementation can't be known in the external world; And the performance of present search engine also respectively has strengths and weaknesses, respectively possesses some good points at different environment.

At present, the search engine on traditional internet is not well positioned to meet mobile environment, and in the field of segmentation, general is not best such as search engines such as Baidu, Google, and very big room for promotion is still arranged aspect search accuracy.Particularly in blog system, not having a kind of now is the search engine of developing at blog system fully, and at the relevant web search of blog title and unsatisfactory aspect reordering.

Present search engine does not have and can carry out the retrieval of related subject and reorder at the characteristic of blog system; But the same with general webpage, all be the extracting of carrying out webpage by the URL chain of controlling depth.Some and the irrelevant webpage of theme have also offered the user; And only be the tolerance of carrying out web page correlation with the matching degree of word frequency or single speech, can not react blog title veritably.

Summary of the invention

The objective of the invention is to overcome the deficiencies in the prior art, the invention provides a kind of web search system and method towards blog web page, can return the webpage relevant well, can effectively improve search accuracy, remove the irrelevant webpage of those themes with the blog title of user search.

In order to address the above problem, the present invention proposes a kind of web search system towards blog web page, described system comprises:

Information extraction modules is used to grasp the webpage relevant with blog title;

Data preparation module, the initial webpage that is used for that described information extraction modules is grasped carry out the structured message extraction and webpage disappears heavily;

Index module is used for the data that described data preparation module is extracted are set up index;

Retrieval module is used to provide the user search interface, retrieves according to described index, and the result of retrieval is sorted.

Preferably, described system also comprises web database, be used to preserve web pages downloaded and handle after data.

Preferably, described system also comprises system interface, and wherein, described system interface comprises internet web page interface and user search inlet.

Preferably, packet purse rope page data and the dictionary index data stored of described web database; Wherein, described web data comprises: webpage numbering, uniform resource position mark URL, title, synopsis, webpage size; Described dictionary index data comprises: the words in the Chinese vocabulary bank, English word, the formation of the corresponding webpage numbering of each words.

Correspondingly, the embodiment of the invention also discloses a kind of web search method towards blog web page, described method comprises:

Grasp the webpage relevant with blog title;

To the initial webpage that the is grasped weight that carries out that structured message extracts and webpage disappears;

The data of being extracted are set up index;

Retrieve according to described index, and the result of retrieval is sorted.

Preferably, the described step that the data of being extracted are set up index comprises:

Call the addDocument of IndexWriter;

Create a Document object;

In the Document object of creating, add and name each field Segment;

Call the addDocument method of DocumentWriter and in index, add document;

Segment information is preserved.

Preferably, after the step of the described extracting webpage relevant, also comprise: the webpage that is grasped is filtered with blog title.

Preferably, the described step that the webpage that is grasped is filtered comprises: the webpage that is grasped is carried out the degree of correlation evaluation of blog title; The lower webpage of the deletion degree of correlation.

Preferably, the described step that the webpage that is grasped is carried out the degree of correlation evaluation of blog title comprises:

The relevant subset page of describing theme is carried out obtaining and weighting of keyword, obtain to belong to the corresponding weight of vector sum vector of this theme feature;

Text to the page is carried out participle, removes those stop words, stays the keyword that needs;

Page title is carried out word segmentation processing, the keyword that obtains and the keyword in the Web page text are merged, and be weighted on the title keyword of acquisition;

Keyword in the page is adjusted and expanded according to the proper vector of theme;

Calculate the similarity sim(D of the page and theme, Di), wherein D is a theme, and Di is the page to be compared;

According to sim(D, Di) size of value and and threshold value d come comparison, if sim(D, Di) equal d greatly, then the page is relevant with theme, this webpage is remained in the storehouse of the theme page; Then delete this webpage on the contrary.

Preferably, describedly retrieve, and the step that the result of retrieval sorts comprised according to described index:

Search word is carried out word segmentation processing;

Find the webpage ID formation of each words by Hash;

To each webpage ID formation of finding do " with ", " or ", the logical operation of " non-", obtain last search result web page ID formation;

The result is sorted from big to small according to weight;

With the higher keyword of the similar weight of the keyword behind the participle, the similar keyword of first word segmentation result is shown;

Finish Pagination Display and handle, calculate each webpage ID formation that will show at last, find relevant structure of web page body memory contents by these webpages ID, display of search results is given the user.

In embodiments of the present invention, data storage is got surplus method by Hash mapping and is mapped in a plurality of servers, and can be good at guaranteeing the load balancing in each storage server; And the blog title relativity measurement method that is adopted can be returned the webpage relevant with the blog title of user search well, can effectively improve search accuracy, removes the irrelevant webpage of those themes.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is that the structure towards the web search system of blog web page of the embodiment of the invention is formed synoptic diagram;

Fig. 2 is that the index employing is got the synoptic diagram that surplus method is carried out the distributed mapping storage by Hash in the embodiment of the invention;

The schematic flow sheet towards the web search method of blog web page of Fig. 3 embodiment of the invention;

Fig. 4 is the synoptic diagram of index aufbauprinciple in the embodiment of the invention;

Fig. 5 utilizes the Lucene technology to carry out the synoptic diagram of the structure of index in the embodiment of the invention;

Fig. 6 is the synoptic diagram that the degree of subject relativity of the page in the embodiment of the invention is differentiated flow process;

Fig. 7 is the synoptic diagram of retrieval process flow process in the embodiment of the invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.

Fig. 1 is that the structure towards the web search system of blog web page of the embodiment of the invention is formed synoptic diagram, and as shown in Figure 1, this system comprises:

Information extraction modules 1 is used to grasp the webpage relevant with blog title;

Data preparation module 2 is used for the initial webpage that information extraction modules 2 the is grasped weight that carries out that structured message extracts and webpage disappears;

Index module 3 is used for the data that data sorting module 3 is extracted are set up index;

Retrieval module 4 is used to provide the user search interface, retrieves according to index, and the result of retrieval is sorted.

This system also comprises system interface, and wherein, system interface comprises internet web page interface 5 and user search inlet 6.

This system also comprises the web database (not shown), be used to preserve web pages downloaded and handle after data.Packet purse rope page data that this web database is stored and dictionary index data; Wherein, this web data comprises: and webpage numbering, URL(uniform resource locator) (Uniform Resource Locator, URL), title, synopsis, webpage size, as shown in table 1; This dictionary index data comprises: the words in the Chinese vocabulary bank, English word, the formation of the corresponding webpage numbering of each words, and as shown in table 2.The webpage numbering is to be used for direct presentation web page, and it is a unique number, therefore must not repeat.During inquiry, obtain the webpage numbering by the dictionary index, in web data, obtain the related data of webpage separately then, this index structure usually is called inverted index in search engine.

In order to obtain the relevant information of webpage fast, need search mechanisms fast, therefore carry out index, to index and the preservation one by one of the webpage on the content server by the most popular index technology: Lucene.Comprising the index information of describing webpage in the web content server, as long as in needs, sending searching request, the server search index can navigate to the position at web page files place fast, and the info web structure of definition is as shown in table 1.This structure is based on the Lucene technological development, and it is very similar that this table is stored as the table of document and lane database.The storage size that can know each web data thus is 592 bytes.

Table 1 info web structure

Field	Describe	Length
			Index	The webpage numbering	Char16
URL	Corresponding machine memory address	Char256
			Title	Title	Char56
Abstract	Synopsis	Char256
			Size	The webpage size	Char8

After info web obtains, just need carry out index to webpage, for each piece web page contents, adopt the branch word algorithm of storage to handle, the speech that branches away is maximum point-score, conveniently can both set up index to each relevant words.All web page contents are all with good according to sort algorithm series arrangement from big to small, so the web page index formation of each words also is according to sort algorithm arrangement from big to small.All words in the dictionary all are to distribute according to Hash to arrange, and are convenient to can find fast behind the query word participle web results ID formation of words correspondence in each dictionary.The index structure that adopts is as shown in table 2.

Table 2 index structure

Field	Describe	Index whether
			Key	Key word behind the participle	Be
Index	The numbering of web content server	Not
			Key_Weight	Weight in the same keyword	Be
Title	Web page title	Not
			URL	The true address of webpage	Not

According to the storage organization of table 2,, can realize ordering and search very easily again based on the high efficiency of Lucene.In the native system after web crawlers crawls into webpage, the resource resolver is the analyzing web page content in real time, after taking out web page title, Web page text etc., partial content is carried out Chinese cut speech, then the form of data with the Field field recorded among the Document, an index can be made up of a plurality of Documents, and can comprise some Field among a Document, only being required to be them specifies little FieldName together to get final product, when the user sends retrieval when requiring, be exactly the retrieval of the specific Field object that specific Document object is comprised in the chicken roost index in fact.

The index stores scheme that the present invention taked is distributed.With N platform machine, the index of distributed memory scan, this is in order to expand on the one hand, is to improve recall precision on the other hand.Specific embodiment has then adopted the mode of database horizontal partitioning.The horizontal partitioning meaning is assigned to different server with the data of same type.The rule of dividing is the key according to search, uses the hash function to the key value, has obtained an integer, next with this numerical value and server count are got surplus, press surplus result and distribute index.Its example as shown in Figure 2.Have two services to pass through Hash(Key) obtain a numerical value, get then surplus, i.e. Hash(key) %2, so just can be distributed to index in the two-server uniformly.

Fig. 3 is the schematic flow sheet towards the web search method of blog web page of the embodiment of the invention, and as shown in Figure 3, this method comprises:

S301 grasps the webpage relevant with blog title;

S302 is to the initial webpage that the is grasped weight that carries out that structured message extracts and webpage disappears;

S303 sets up index to the data of being extracted;

S304 retrieves according to index, and the result of retrieval is sorted.

The present invention utilizes the Lucen technology to carry out the structure (S303) of index, principle that it is concrete and process such as Fig. 4, shown in Figure 5.Specific as follows:

Step 1: from calling the addDocument method of IndexWriter, the major responsibility of IndexWriter is to add document in index, and it provides the main external interface of setting up index; Change step 2;

Step 2: create a Document object.

Step 3: in the Document object of creating, add and name each field Segment; Change step 4;

Step 4: the addDocument method of calling DocumentWriter is added document in index;

Step 5: Segment information is preserved, whether merge, then merge if desired, otherwise process direct preservation if a plurality of Segment are arranged then consider.

After S301, also comprise:

The webpage that is grasped is filtered.

Further, the step that the webpage that is grasped is filtered comprises:

The webpage that is grasped is carried out the degree of correlation evaluation of blog title;

The lower webpage of the deletion degree of correlation.

In order to improve the accuracy rate of gathering page correspondence, need carry out the evaluation of the degree of correlation of blog title to the page of having gathered, Here it is page filtering technique.The evaluation result page on the low side (just less than preset threshold) deletion, attention can improve the accuracy rate in the page of the theme of gathering.The elimination method that native system is taked is based on the vector space model algorithm of keyword.The alike degree of the page and theme can be measured with the angle of vector, and the words that angle is more little just can illustrate that similarity is high more more.The degree of subject relativity of the page is differentiated flow process as shown in Figure 6, idiographic flow as follows:

Step 1: at first be pretreatment stage: before gathering, the relevant subset page of describing theme is carried out obtaining and weighting of keyword, thereby can obtain belonging to the corresponding weight of vector sum vector of this theme feature, change step 2;

Step 2: the text to the page is carried out participle, removes those stop words, stays the keyword that needs.The subject concept of some keywords in the theme feature vector, calculate according to the different position that it occurs in article then, then weighted frequency simultaneously as needs; Change step 3;

Step 3: page title carries out word segmentation processing, the keyword that obtains and some keywords in the Web page text is merged, and be weighted on the title keyword of acquisition; Change step 4;

Step 4:, the keyword in the page is adjusted and expanded according to the proper vector of theme; Change step 5;

Step 5: calculate the similarity sim(D of the page and theme, Di), wherein D is a theme, and Di is the page to be compared; Change step 6;

Step 6: according to sim(D, Di) value size and and threshold value d come comparison, suppose sim(D, Di) equal d greatly, then the page is relevant with theme, this webpage is remained in the storehouse of the theme page; Then delete this webpage on the contrary.

The numerical value that can calculate a double precision by top algorithm is represented the topic relativity of a webpage.In the configuration file of system, set a filtration parameter, just the numerical value of a double precision is relevant if the correlation values of the webpage of being gathered, then proves this page theme greater than this parameter, can indexedly store.

Next retrieval flow among the present invention is described.When the user imports an inquiry, need carry out participle, in index data base, mate then, navigate to corresponding web data according to index then and return to the user this input.Specifically as shown in Figure 7, idiographic flow is as follows:

Step 1: search word is carried out word segmentation processing; Change step 2;

Step 2: the webpage ID formation of each words that finds by Hash; Change step 3;

Step 3: to each webpage ID formation of finding do " with ", " or ", the logical operation of " non-", obtain last search result web page ID formation; Change step 4;

Step 4: the result is sorted from big to small according to weight; Change step 5;

Step 5: with the higher keyword of the similar weight of the keyword behind the participle, the similar keyword of first word segmentation result is shown, other is incoherent then will not to show; Step 6;

Step 6: finish Pagination Display and handle, (general every page shows 10 during the internet hunt webpage to calculate each webpage ID formation that will show at last, so, this number mostly is 10 most), by these webpages ID, find relevant structure of web page body memory contents, display of search results is given the user.

One of ordinary skill in the art will appreciate that all or part of step in the whole bag of tricks of the foregoing description is to instruct relevant hardware to finish by program, this program can be stored in the computer-readable recording medium, storage medium can comprise: ROM (read-only memory) (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc.

In addition, more than the web search system and method towards blog web page that the embodiment of the invention provided is described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. web search system towards blog web page is characterized in that described system comprises:

2. the web search system towards blog web page as claimed in claim 1 is characterized in that described system also comprises web database, be used to preserve web pages downloaded and handle after data.

3. the web search system towards blog web page as claimed in claim 2 is characterized in that described system also comprises system interface, and wherein, described system interface comprises internet web page interface and user search inlet.

4. the web search system towards blog web page as claimed in claim 2 is characterized in that, packet purse rope page data that described web database is stored and dictionary index data; Wherein, described web data comprises: webpage numbering, uniform resource position mark URL, title, synopsis, webpage size; Described dictionary index data comprises: the words in the Chinese vocabulary bank, English word, the formation of the corresponding webpage numbering of each words.

5. web search method towards blog web page is characterized in that described method comprises:

Grasp the webpage relevant with blog title;

The data of being extracted are set up index;

Retrieve according to described index, and the result of retrieval is sorted.

6. the web search method towards blog web page as claimed in claim 5 is characterized in that, the described step that the data of being extracted are set up index comprises:

Call the addDocument of IndexWriter;

Create a Document object;

In the Document object of creating, add and name each field Segment;

Call the addDocument method of DocumentWriter and in index, add document;

Segment information is preserved.

7. the web search method towards blog web page as claimed in claim 5 is characterized in that, also comprises after the step of the described extracting webpage relevant with blog title: the webpage that is grasped is filtered.

8. the web search method towards blog web page as claimed in claim 7 is characterized in that the described step that the webpage that is grasped is filtered comprises: the webpage that is grasped is carried out the degree of correlation evaluation of blog title; The lower webpage of the deletion degree of correlation.

9. the web search method towards blog web page as claimed in claim 8 is characterized in that, the described step that the webpage that is grasped is carried out the degree of correlation evaluation of blog title comprises:

10. the web search method towards blog web page as claimed in claim 8 is characterized in that, describedly retrieves according to described index, and the step that the result of retrieval sorts is comprised:

Search word is carried out word segmentation processing;

Find the webpage ID formation of each words by Hash;

The result is sorted from big to small according to weight;