CN103617174A

CN103617174A - Distributed searching method based on cloud computing

Info

Publication number: CN103617174A
Application number: CN201310536651.8A
Authority: CN
Inventors: 向阳; 陈佑雄; 张依杨; 平宇; 张波; 袁书寒
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2013-11-04
Filing date: 2013-11-04
Publication date: 2014-03-05

Abstract

The invention discloses a distributed searching method based on cloud computing. The method includes the steps that network files of various formats are crawled through a distributed web crawler; a document table format with a user-defined format is extracted through the files crawled by the distributed parallel extraction analysis crawler; the extracted document content is stored into a distributed database, and a document table database is established; an index table is established through the document table database, and a parallel computing technology is also adopted; an index table format is also of a user-defined format; index files are imported to an index database, and index data are provided for a searcher; a PageRank and optimized on-line sorting algorithm is adopted in search results. The distributed searching method based on cloud computing has the advantages that the distributed storage and computing characteristic is adopted, by the aid of the improved and optimized sorting algorithm, the search results are more accurate, and due to the fact that the semantic extension keyword technology is used, and the search results are richer.

Description

A kind of distributed search methods based on cloud computing

Technical field

The present invention relates to a kind of distributed searching method, especially process the distributed search methods based on cloud computing that carries out quick-searching under large data.

Background technology

Along with Internet develops rapidly, WWW (World Wide Web is called for short WWW) has become a huge information space, for user provides valuable information resources.And in the face of a large amount of information resources, it is very inconvenient to browse step by step by browser, how to obtain information needed from WWW fast and accurately, becomes vital problem.The appearance of search engine, has improved the ability that people gather information greatly.Yet existing search engine also exists problems at aspects such as search efficiency, maintenance of information, information repetition, network and website, loads.

At present, from architecture, most of search engine is centralized.From Internet, fetch the page, by analysis, process after by all index information centralized stores at certain website, user realizes inquiry by this website of access.Between them, conventionally there is no what cooperation, separate searches and process information, caused a large amount of repeated works and serious bandwidth waste separately, sometimes even can cause network congestion.This architecture is difficult to adapt to the expanding day of network size, and industry proposes to set up the strategy of distributed search engine one after another.

Traditional search engine, it is universal search engine, in application, can provide a large amount of Search Results for user, but these universal search engines are when more information is returned in pursuit, be difficult to take into account accuracy and the degree of correlation of Search Results, thereby cause that webpage coverage rate is lower, the information updating problem such as not in time.Due to traditional search engines, to exist coverage rate limited, and precision ratio is low, the shortcoming that End-user relevance is poor, and also industry user has the requirement that information requirement is concentrated relatively, classification is meticulousr, and universal search engine lacks enough guide effects.

The limitation of lacking individuality of traditional search engines is in particular in:

(1) network data magnanimity: the large broad covered area of network information quantity, need to consume a lot of time and storage space to the calculating of these data and storage.

(2) user's otherness: user context knowledge is different, separately the understanding of the meaning of a word is also not quite similar, and has different tendencies for identical term different user.

(3) retrieval and time correlation: user is in the same retrieval request in different times or stage, be resultingly still identical result for retrieval, and user is not had to adaptive ability.

(4) expression of term: user is due to the deficiency of domain knowledge, and the query interface of search engine has limitation, thus cannot realize accurately user's search intention.

Therefore, how to make user conveniently from the Search Results of magnanimity, obtain required information, become a problem in the urgent need to address.

Summary of the invention

Technical matters to be solved by this invention is that a kind of result for retrieval distributed search methods based on cloud computing more accurately will be provided.

In order to solve above technical matters, the invention provides a kind of distributed search methods based on cloud computing, the method comprises the following steps:

Step is (1): by distributed web crawlers, crawl the network file of multiple format, comprise HTML, PPT, EXCEL, pdf document;

Step is (2): by distributed paralleling abstracting, resolve the file that reptile crawled, extracting form is self-defining document sheet format, the relevant informations such as text wherein of extraction, title, author;

Specifically: URL+ title+parsing time+author+source+text+pr value+classification+link.

Wherein: url is web page interlinkage, title is web page title, and the parsing time refers to resolved the date on the same day, author refers to web page authors, and initial value is " the unknown ", and source refers to web document source, initial value is " the unknown ", text refer to webpage remove after html label body matter, Pr value refers to pagerank value, be defaulted as 1, classification refers to the classification of webpage, and acquiescence is 0, and link refers to the link that webpage points to, by regular expression, screen coupling, centre connects with space.

Step is (3): the document content having extracted is deposited in distributed database, set up document table database;

Step is (4): by document table Database concordance list, also adopt parallel computing, concordance list form is also self-defining form;

Specifically: keyword+" 007 "+url+ " t "+word frequency+" t "+pr+ " and t "+type.

Wherein: keyword is the term of inverted index; Url is the link of document; Word frequency is the number of times that keyword occurs in the document; Pr value is document pagerank value; Time is the parsing time; Type is document classification.

Step is (5): index file is imported to index data base, for searcher provides index data;

Step is (6): the online scheduling algorithm that result for retrieval is adopted to PageRank and optimization.

Wherein, the described step network file that crawls (1) comprises the following steps:

1. the webpage network address initially crawling is set, because web crawlers crawls the process that web page files is a recurrence, in order to obtain better the whole network, crawls effect, Initial page url is traditionally arranged to be navigation network address;

2. from step, obtain the page of a navigation website 1., by resolving this page, obtain a large amount of website homepages;

3. continue to resolve these homepages and can obtain more network address, then repeat this process.

Wherein, the PageRank value calculating method of described step in is (4) as follows:

R ' (u) represents similarity, c=0.85 (the c is here ratio of damping), B _vrefer to the studied page, N _vbe the quantity of page v chain page-out, N refers to all pages, and E (u) refers to that user stops clicking, and jumps to the probability of new URL, and computing method are as follows:

Concept set is the set of the word that comprises in document table.

Property set is the set with the series of parameters of the feature of descriptor.

Described property set comprises word frequency, pagerank value, and there is position in keyword.

Example set is the set of search records that some key words comprise.

Set of relations is the set of keyword and index record.

The present invention compared with prior art, has the following advantages:

1, well utilized the feature of distributed storage and calculating;

2, adopted the sort algorithm that improves and optimize, result for retrieval is more accurate;

3, adopted semantic extension keyword technology, Query Result is abundanter.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention;

Fig. 2 is the display page of result for retrieval of the present invention.

embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

As shown in Figure 1, the invention provides a kind of distributed search methods based on cloud computing, the method comprises the following steps:

Figure 2013105366518100002DEST_PATH_IMAGE002

Described concept set is the set of the word that comprises in document table.

Described property set is the set with the series of parameters of the feature of descriptor.

Described example set is the set of search records that some key words comprise.

Described set of relations is the set of keyword and index record.

Two retrieval examples of take are set forth technical scheme of the present invention as example is further:

Web data in case is that dispatching algorithm customizable, high scalability makes searcher can within the extremely short time, collect the internet information of maximum quantity by distributed network reptile search information in internet automatically.Finance cloud search is equipped with server in many places, and the part website in the Chinese areas such as China's Mainland, Hong Kong, Taiwan, Macao, Singapore and North America, Europe has been contained in hunting zone.This search engine has been contained comprehensive financial class info web, and more than total amount reaches 2,000,000 webpages at present, and every day is still in continuous growth.

(1) for every kind of output type provides example.To each in example, illustrative examples is as input: " Tongji University " retrieved, and Output rusults presents as Fig. 2:

From result for retrieval, can find out that this method has adopted distributed retrieval technique and in conjunction with semantic search key expansion technique.

So far a kind of distributed search methods based on cloud computing of case show complete, the method can for after vertical search engine customization, the prototype system of the correlation search engines such as industry class search engine.

Claims

1. the distributed search methods based on cloud computing, the method comprises the following steps:

Step is (1): the network file that crawls multiple format by distributed web crawlers;

Step is (2): by distributed paralleling abstracting, resolve the file that reptile crawled, extraction form is self-defining document sheet format;

2. the distributed search methods based on cloud computing according to claim 1, is characterized in that: the described step network file that crawls (1) comprises the following steps:

3. the distributed search methods based on cloud computing according to claim 1, is characterized in that: the PageRank value calculating method of described step in is (4) as follows:

。