CN107220362A

CN107220362A - A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword

Info

Publication number: CN107220362A
Application number: CN201710422522.4A
Authority: CN
Inventors: 张军; 徐苛; 陈晓峰
Original assignee: Shanghai DC Science Co Ltd
Current assignee: Shanghai DC Science Co Ltd
Priority date: 2017-06-08
Filing date: 2017-06-08
Publication date: 2017-09-29

Abstract

The present invention discloses a kind of web crawlers for network documentation and extracts URL and the framework for indexing and being mapped with keyword, can be on the premise of suitably increase data volume, URL is indexed by METAFILE keyword, and mapping is set up with associative key, semantic retrieval is carried out to network documentation using keyword.

Description

A kind of web crawlers for network documentation extracts URL and indexes and reflected with keyword The framework penetrated

Technical field

The framework for extracting URL the present invention relates to a kind of web crawlers for network documentation and indexing and being mapped with keyword

Background technology

Current search engine is scanned for just for text, can't be effectively to the multimedias such as music, picture and video text Part is scanned for, and it is too big that reason is mainly multi-medium data amount；How index multi-media file；And then to treated multimedia Document retrieval.Now have rise that substantial amounts of multimedia file, particularly social network sites are shared with multimedia on the internet, it is necessary to Multimedia file is precisely retrieved.

Web crawlers, also referred to as Web Spider, network robot, are a programs for automatically extracting webpage, and it is from internet Contained network page, is the important component of search engine up and down.Web crawlers utilizes the http protocol of standard, according to hyperlink and The method traversal internet information space of network documentation retrieval.There are thousands of kinds of different data types on internet, HTTP is to every The data format label of entitled mime type will all have been stamped by the object of network transmission by planting.URL (URL) It is the most common form of resource identifier.URL describes the ad-hoc location of certain resource on a particular server.Element files (METAFILE) metamessage about the page can be provided, can pin such as search engine and description and the keyword of update frequency The keyword of element is indexed.

The data of web search are often higher-dimension, and its dimension is even up to million orders of magnitude.It was found that and the high dimension of utilization Low dimensional structures in, are particularly important in web search.In addition, in web search, people can only observe on a small quantity Element, it is desirable to according to these limited information, a great number of elements do not seen can be guessed, so as to recover a unknown low-rank Matrix or approximate low-rank matrix.

Given that it is known that data have been arranged in a high dimensional data or sample matrix.The problem of estimating a lower-dimensional subspace is referred to as low Order matrix approximation., being capable of the impaired member of automatic identification when some elements of low-rank matrix or sample matrix are seriously damaged Element, accurately recovers former low-rank matrix., it is necessary to be a low-rank matrix and one by a data matrix decomposition in web search Individual sparse matrix sum, and it is desirable that recover low-rank matrix and sparse matrix simultaneously.

The frame for extracting URL the invention provides a kind of web crawlers for network documentation and indexing and being mapped with keyword Frame, can be indexed on the premise of suitably increase data volume by METAFILE keyword to URL, and and related keyword Word sets up mapping, and semantic retrieval is carried out to network documentation using keyword.

The content of the invention

URL is extracted it is an object of the invention to provide a kind of web crawlers for network documentation and is indexed and and keyword The framework of mapping.The present invention includes following characteristics：

Inventive technique scheme

1. a kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword, its is specific Step is as follows：

1) web crawlers is since traversal parameter and starting URL；

2) the contained network pages above and below network of first URL in URL storehouses are used；

3) repeated pages inspection is passed it to, the accuracy for repeating to verify depends on specific traversal parameter；

If 4) webpage is not rejected, it is saved in web page library；

5) and pass to link extract；

6) link is extracted extracts link from the METAFILE of webpage, passes to URL inspections；If accessed before, or The standard listed in traversal parameter list is not met, then refusal is downloaded；

7) while extracting keyword, keywords database is passed to, in case semantic retrieval；

8) URL not being rejected is indexed, passs URL storehouses；And set up mapping with associative key；

9) then a URL not being accessed is passed to webpage and extracted by URL storehouses.

Brief description of the drawings

Fig. 1 is the frame diagram for extracting URL for the web crawlers of network documentation and indexing and being mapped with keyword.

Embodiment

This web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword, including as follows Step：

1) web crawlers is since traversal parameter and starting URL；

If 4) webpage is not rejected, it is saved in web page library；

5) and pass to link extract；

Claims

1. a kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword, its specific steps It is as follows：

1) web crawlers is since traversal parameter and starting URL；

If 4) webpage is not rejected, it is saved in web page library；

5) and pass to link extract；

6) link is extracted extracts link from the METAFILE of webpage, passes to URL inspections；If accessed before, or it is not inconsistent The standard listed in traversal parameter list is closed, then refusal is downloaded；