CN103617174A - Distributed searching method based on cloud computing - Google Patents

Distributed searching method based on cloud computing Download PDF

Info

Publication number
CN103617174A
CN103617174A CN201310536651.8A CN201310536651A CN103617174A CN 103617174 A CN103617174 A CN 103617174A CN 201310536651 A CN201310536651 A CN 201310536651A CN 103617174 A CN103617174 A CN 103617174A
Authority
CN
China
Prior art keywords
distributed
page
cloud computing
index
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310536651.8A
Other languages
Chinese (zh)
Inventor
向阳
陈佑雄
张依杨
平宇
张波
袁书寒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201310536651.8A priority Critical patent/CN103617174A/en
Publication of CN103617174A publication Critical patent/CN103617174A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed searching method based on cloud computing. The method includes the steps that network files of various formats are crawled through a distributed web crawler; a document table format with a user-defined format is extracted through the files crawled by the distributed parallel extraction analysis crawler; the extracted document content is stored into a distributed database, and a document table database is established; an index table is established through the document table database, and a parallel computing technology is also adopted; an index table format is also of a user-defined format; index files are imported to an index database, and index data are provided for a searcher; a PageRank and optimized on-line sorting algorithm is adopted in search results. The distributed searching method based on cloud computing has the advantages that the distributed storage and computing characteristic is adopted, by the aid of the improved and optimized sorting algorithm, the search results are more accurate, and due to the fact that the semantic extension keyword technology is used, and the search results are richer.

Description

A kind of distributed search methods based on cloud computing
Technical field
The present invention relates to a kind of distributed searching method, especially process the distributed search methods based on cloud computing that carries out quick-searching under large data.
Background technology
Along with Internet develops rapidly, WWW (World Wide Web is called for short WWW) has become a huge information space, for user provides valuable information resources.And in the face of a large amount of information resources, it is very inconvenient to browse step by step by browser, how to obtain information needed from WWW fast and accurately, becomes vital problem.The appearance of search engine, has improved the ability that people gather information greatly.Yet existing search engine also exists problems at aspects such as search efficiency, maintenance of information, information repetition, network and website, loads.
At present, from architecture, most of search engine is centralized.From Internet, fetch the page, by analysis, process after by all index information centralized stores at certain website, user realizes inquiry by this website of access.Between them, conventionally there is no what cooperation, separate searches and process information, caused a large amount of repeated works and serious bandwidth waste separately, sometimes even can cause network congestion.This architecture is difficult to adapt to the expanding day of network size, and industry proposes to set up the strategy of distributed search engine one after another.
Traditional search engine, it is universal search engine, in application, can provide a large amount of Search Results for user, but these universal search engines are when more information is returned in pursuit, be difficult to take into account accuracy and the degree of correlation of Search Results, thereby cause that webpage coverage rate is lower, the information updating problem such as not in time.Due to traditional search engines, to exist coverage rate limited, and precision ratio is low, the shortcoming that End-user relevance is poor, and also industry user has the requirement that information requirement is concentrated relatively, classification is meticulousr, and universal search engine lacks enough guide effects.
The limitation of lacking individuality of traditional search engines is in particular in:
(1) network data magnanimity: the large broad covered area of network information quantity, need to consume a lot of time and storage space to the calculating of these data and storage.
(2) user's otherness: user context knowledge is different, separately the understanding of the meaning of a word is also not quite similar, and has different tendencies for identical term different user.
(3) retrieval and time correlation: user is in the same retrieval request in different times or stage, be resultingly still identical result for retrieval, and user is not had to adaptive ability.
(4) expression of term: user is due to the deficiency of domain knowledge, and the query interface of search engine has limitation, thus cannot realize accurately user's search intention.
Therefore, how to make user conveniently from the Search Results of magnanimity, obtain required information, become a problem in the urgent need to address.
Summary of the invention
Technical matters to be solved by this invention is that a kind of result for retrieval distributed search methods based on cloud computing more accurately will be provided.
In order to solve above technical matters, the invention provides a kind of distributed search methods based on cloud computing, the method comprises the following steps:
Step is (1): by distributed web crawlers, crawl the network file of multiple format, comprise HTML, PPT, EXCEL, pdf document;
Step is (2): by distributed paralleling abstracting, resolve the file that reptile crawled, extracting form is self-defining document sheet format, the relevant informations such as text wherein of extraction, title, author;
Specifically: URL+ title+parsing time+author+source+text+pr value+classification+link.
Wherein: url is web page interlinkage, title is web page title, and the parsing time refers to resolved the date on the same day, author refers to web page authors, and initial value is " the unknown ", and source refers to web document source, initial value is " the unknown ", text refer to webpage remove after html label body matter, Pr value refers to pagerank value, be defaulted as 1, classification refers to the classification of webpage, and acquiescence is 0, and link refers to the link that webpage points to, by regular expression, screen coupling, centre connects with space.
Step is (3): the document content having extracted is deposited in distributed database, set up document table database;
Step is (4): by document table Database concordance list, also adopt parallel computing, concordance list form is also self-defining form;
Specifically: keyword+" 007 "+url+ " t "+word frequency+" t "+pr+ " and t "+type.
Wherein: keyword is the term of inverted index; Url is the link of document; Word frequency is the number of times that keyword occurs in the document; Pr value is document pagerank value; Time is the parsing time; Type is document classification.
Step is (5): index file is imported to index data base, for searcher provides index data;
Step is (6): the online scheduling algorithm that result for retrieval is adopted to PageRank and optimization.
Wherein, the described step network file that crawls (1) comprises the following steps:
1. the webpage network address initially crawling is set, because web crawlers crawls the process that web page files is a recurrence, in order to obtain better the whole network, crawls effect, Initial page url is traditionally arranged to be navigation network address;
2. from step, obtain the page of a navigation website 1., by resolving this page, obtain a large amount of website homepages;
3. continue to resolve these homepages and can obtain more network address, then repeat this process.
Wherein, the PageRank value calculating method of described step in is (4) as follows:
R ' (u) represents similarity, c=0.85 (the c is here ratio of damping), B vrefer to the studied page, N vbe the quantity of page v chain page-out, N refers to all pages, and E (u) refers to that user stops clicking, and jumps to the probability of new URL, and computing method are as follows:
Concept set is the set of the word that comprises in document table.
Property set is the set with the series of parameters of the feature of descriptor.
Described property set comprises word frequency, pagerank value, and there is position in keyword.
Example set is the set of search records that some key words comprise.
Set of relations is the set of keyword and index record.
The present invention compared with prior art, has the following advantages:
1, well utilized the feature of distributed storage and calculating;
2, adopted the sort algorithm that improves and optimize, result for retrieval is more accurate;
3, adopted semantic extension keyword technology, Query Result is abundanter.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is the display page of result for retrieval of the present invention.
embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
As shown in Figure 1, the invention provides a kind of distributed search methods based on cloud computing, the method comprises the following steps:
Step is (1): by distributed web crawlers, crawl the network file of multiple format, comprise HTML, PPT, EXCEL, pdf document;
Step is (2): by distributed paralleling abstracting, resolve the file that reptile crawled, extracting form is self-defining document sheet format, the relevant informations such as text wherein of extraction, title, author;
Specifically: URL+ title+parsing time+author+source+text+pr value+classification+link.
Wherein: url is web page interlinkage, title is web page title, and the parsing time refers to resolved the date on the same day, author refers to web page authors, and initial value is " the unknown ", and source refers to web document source, initial value is " the unknown ", text refer to webpage remove after html label body matter, Pr value refers to pagerank value, be defaulted as 1, classification refers to the classification of webpage, and acquiescence is 0, and link refers to the link that webpage points to, by regular expression, screen coupling, centre connects with space.
Step is (3): the document content having extracted is deposited in distributed database, set up document table database;
Step is (4): by document table Database concordance list, also adopt parallel computing, concordance list form is also self-defining form;
Specifically: keyword+" 007 "+url+ " t "+word frequency+" t "+pr+ " and t "+type.
Wherein: keyword is the term of inverted index; Url is the link of document; Word frequency is the number of times that keyword occurs in the document; Pr value is document pagerank value; Time is the parsing time; Type is document classification.
Step is (5): index file is imported to index data base, for searcher provides index data;
Step is (6): the online scheduling algorithm that result for retrieval is adopted to PageRank and optimization.
Wherein, the described step network file that crawls (1) comprises the following steps:
1. the webpage network address initially crawling is set, because web crawlers crawls the process that web page files is a recurrence, in order to obtain better the whole network, crawls effect, Initial page url is traditionally arranged to be navigation network address;
2. from step, obtain the page of a navigation website 1., by resolving this page, obtain a large amount of website homepages;
3. continue to resolve these homepages and can obtain more network address, then repeat this process.
Wherein, the PageRank value calculating method of described step in is (4) as follows:
R ' (u) represents similarity, c=0.85 (the c is here ratio of damping), B vrefer to the studied page, N vbe the quantity of page v chain page-out, N refers to all pages, and E (u) refers to that user stops clicking, and jumps to the probability of new URL, and computing method are as follows:
Figure 2013105366518100002DEST_PATH_IMAGE002
Described concept set is the set of the word that comprises in document table.
Described property set is the set with the series of parameters of the feature of descriptor.
Described example set is the set of search records that some key words comprise.
Described set of relations is the set of keyword and index record.
Described property set comprises word frequency, pagerank value, and there is position in keyword.
Two retrieval examples of take are set forth technical scheme of the present invention as example is further:
Web data in case is that dispatching algorithm customizable, high scalability makes searcher can within the extremely short time, collect the internet information of maximum quantity by distributed network reptile search information in internet automatically.Finance cloud search is equipped with server in many places, and the part website in the Chinese areas such as China's Mainland, Hong Kong, Taiwan, Macao, Singapore and North America, Europe has been contained in hunting zone.This search engine has been contained comprehensive financial class info web, and more than total amount reaches 2,000,000 webpages at present, and every day is still in continuous growth.
(1) for every kind of output type provides example.To each in example, illustrative examples is as input: " Tongji University " retrieved, and Output rusults presents as Fig. 2:
From result for retrieval, can find out that this method has adopted distributed retrieval technique and in conjunction with semantic search key expansion technique.
So far a kind of distributed search methods based on cloud computing of case show complete, the method can for after vertical search engine customization, the prototype system of the correlation search engines such as industry class search engine.

Claims (3)

1. the distributed search methods based on cloud computing, the method comprises the following steps:
Step is (1): the network file that crawls multiple format by distributed web crawlers;
Step is (2): by distributed paralleling abstracting, resolve the file that reptile crawled, extraction form is self-defining document sheet format;
Step is (3): the document content having extracted is deposited in distributed database, set up document table database;
Step is (4): by document table Database concordance list, also adopt parallel computing, concordance list form is also self-defining form;
Step is (5): index file is imported to index data base, for searcher provides index data;
Step is (6): the online scheduling algorithm that result for retrieval is adopted to PageRank and optimization.
2. the distributed search methods based on cloud computing according to claim 1, is characterized in that: the described step network file that crawls (1) comprises the following steps:
1. the webpage network address initially crawling is set, because web crawlers crawls the process that web page files is a recurrence, in order to obtain better the whole network, crawls effect, Initial page url is traditionally arranged to be navigation network address;
2. from step, obtain the page of a navigation website 1., by resolving this page, obtain a large amount of website homepages;
3. continue to resolve these homepages and can obtain more network address, then repeat this process.
3. the distributed search methods based on cloud computing according to claim 1, is characterized in that: the PageRank value calculating method of described step in is (4) as follows:
R ' (u) represents similarity, c=0.85 (the c is here ratio of damping), B vrefer to the studied page, N vbe the quantity of page v chain page-out, N refers to all pages, and E (u) refers to that user stops clicking, and jumps to the probability of new URL, and computing method are as follows:
CN201310536651.8A 2013-11-04 2013-11-04 Distributed searching method based on cloud computing Pending CN103617174A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310536651.8A CN103617174A (en) 2013-11-04 2013-11-04 Distributed searching method based on cloud computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310536651.8A CN103617174A (en) 2013-11-04 2013-11-04 Distributed searching method based on cloud computing

Publications (1)

Publication Number Publication Date
CN103617174A true CN103617174A (en) 2014-03-05

Family

ID=50167877

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310536651.8A Pending CN103617174A (en) 2013-11-04 2013-11-04 Distributed searching method based on cloud computing

Country Status (1)

Country Link
CN (1) CN103617174A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488165A (en) * 2015-11-30 2016-04-13 北京金山安全软件有限公司 Data retrieval method and system based on index database
CN105488166A (en) * 2015-11-30 2016-04-13 北京金山安全软件有限公司 Index establishing method and device
CN106156104A (en) * 2015-04-02 2016-11-23 北京奇虎科技有限公司 Crawl the method and device of corporate intranet information
CN107341274A (en) * 2017-08-31 2017-11-10 郑州云海信息技术有限公司 A kind of full-text search engine and data retrieval method
CN108062329A (en) * 2016-11-08 2018-05-22 北京国双科技有限公司 A kind of data lead-in method and device
CN109359173A (en) * 2018-10-24 2019-02-19 南京大学 A kind of search method of judgement document
CN109918558A (en) * 2019-03-14 2019-06-21 云南电网有限责任公司信息中心 A kind of big data acquisition interface and acquisition method based on the technology that crawls
CN110110024A (en) * 2019-04-29 2019-08-09 东南大学 A kind of large capacity VCT file importing spatial database method
CN110147362A (en) * 2019-04-04 2019-08-20 中电科大数据研究院有限公司 One kind is based on the acquisition of event driven DOC DATA and processing system and its method
CN112380418A (en) * 2020-12-31 2021-02-19 广州智云尚大数据科技有限公司 Data processing method and system based on web crawler and cloud platform
CN113742549A (en) * 2020-05-28 2021-12-03 上海交通大学 Distributed crawler scheduling system and method based on computing resources
CN113934911A (en) * 2021-10-20 2022-01-14 国网江苏省电力有限公司镇江供电分公司 File crawling and searching method and system
CN113987146A (en) * 2021-10-22 2022-01-28 国网江苏省电力有限公司镇江供电分公司 Dedicated novel intelligence of electric power intranet system of asking for answering
US11681701B2 (en) 2020-05-12 2023-06-20 Coupang Corp. Systems and methods for reducing database query latency

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339292A (en) * 2010-07-27 2012-02-01 中国电信股份有限公司 Distributed searching method and system
CN102968465A (en) * 2012-11-09 2013-03-13 同济大学 Network information service platform and search service method based on network information service platform

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339292A (en) * 2010-07-27 2012-02-01 中国电信股份有限公司 Distributed searching method and system
CN102968465A (en) * 2012-11-09 2013-03-13 同济大学 Network information service platform and search service method based on network information service platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
穆锦佳: "基于Hadoop的分布式爬虫及其实现", 《中国优秀硕士论文全文数据库(信息科技辑)》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156104A (en) * 2015-04-02 2016-11-23 北京奇虎科技有限公司 Crawl the method and device of corporate intranet information
CN105488165A (en) * 2015-11-30 2016-04-13 北京金山安全软件有限公司 Data retrieval method and system based on index database
CN105488166A (en) * 2015-11-30 2016-04-13 北京金山安全软件有限公司 Index establishing method and device
CN108062329A (en) * 2016-11-08 2018-05-22 北京国双科技有限公司 A kind of data lead-in method and device
CN107341274A (en) * 2017-08-31 2017-11-10 郑州云海信息技术有限公司 A kind of full-text search engine and data retrieval method
CN109359173A (en) * 2018-10-24 2019-02-19 南京大学 A kind of search method of judgement document
CN109918558A (en) * 2019-03-14 2019-06-21 云南电网有限责任公司信息中心 A kind of big data acquisition interface and acquisition method based on the technology that crawls
CN110147362A (en) * 2019-04-04 2019-08-20 中电科大数据研究院有限公司 One kind is based on the acquisition of event driven DOC DATA and processing system and its method
CN110110024A (en) * 2019-04-29 2019-08-09 东南大学 A kind of large capacity VCT file importing spatial database method
CN110110024B (en) * 2019-04-29 2021-12-17 东南大学 Method for importing high-capacity VCT file into spatial database
US11681701B2 (en) 2020-05-12 2023-06-20 Coupang Corp. Systems and methods for reducing database query latency
CN113742549A (en) * 2020-05-28 2021-12-03 上海交通大学 Distributed crawler scheduling system and method based on computing resources
CN112380418A (en) * 2020-12-31 2021-02-19 广州智云尚大数据科技有限公司 Data processing method and system based on web crawler and cloud platform
CN113934911A (en) * 2021-10-20 2022-01-14 国网江苏省电力有限公司镇江供电分公司 File crawling and searching method and system
CN113987146A (en) * 2021-10-22 2022-01-28 国网江苏省电力有限公司镇江供电分公司 Dedicated novel intelligence of electric power intranet system of asking for answering
CN113987146B (en) * 2021-10-22 2023-01-31 国网江苏省电力有限公司镇江供电分公司 Dedicated intelligent question-answering system of electric power intranet

Similar Documents

Publication Publication Date Title
CN103617174A (en) Distributed searching method based on cloud computing
CN103049575B (en) A kind of academic conference search system of topic adaptation
US10146862B2 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN100401300C (en) Searching engine with automating sorting function
TWI695277B (en) Automatic website data collection method
US20170212899A1 (en) Method for searching related entities through entity co-occurrence
CN101620608A (en) Information collection method and system
CN105022827A (en) Field subject-oriented Web news dynamic aggregation method
CN103559258A (en) Webpage ranking method based on cloud computation
CN103838785A (en) Vertical search engine in patent field
CN103324622A (en) Method and device for automatic generating of front page abstract
CN104199833A (en) Network search term clustering method and device
CN103678412A (en) Document retrieval method and device
CN102789464A (en) Natural language processing method, device and system based on semanteme recognition
Richards et al. The Archaeology Data Service and the Archaeotools project: faceted classification and natural language processing
Nikhil et al. A survey on text mining and sentiment analysis for unstructured web data
CN111859065A (en) Big data-based public opinion listening system
US8949254B1 (en) Enhancing the content and structure of a corpus of content
Jin et al. Tise: A temporal search engine for web contents
CN109948015B (en) Meta search list result extraction method and system
CN105808761A (en) Solr webpage sorting optimization method based on big data
CN102508920B (en) Information retrieval method based on Boosting sorting algorithm
Zhang et al. An improved ontology-based web information extraction
Singh et al. User specific context construction for personalized multimedia retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140305

WD01 Invention patent application deemed withdrawn after publication