CN107220362A - A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword - Google Patents

A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword Download PDF

Info

Publication number
CN107220362A
CN107220362A CN201710422522.4A CN201710422522A CN107220362A CN 107220362 A CN107220362 A CN 107220362A CN 201710422522 A CN201710422522 A CN 201710422522A CN 107220362 A CN107220362 A CN 107220362A
Authority
CN
China
Prior art keywords
url
keyword
web crawlers
network
indexing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710422522.4A
Other languages
Chinese (zh)
Inventor
张军
徐苛
陈晓峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai DC Science Co Ltd
Original Assignee
Shanghai DC Science Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai DC Science Co Ltd filed Critical Shanghai DC Science Co Ltd
Priority to CN201710422522.4A priority Critical patent/CN107220362A/en
Publication of CN107220362A publication Critical patent/CN107220362A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The present invention discloses a kind of web crawlers for network documentation and extracts URL and the framework for indexing and being mapped with keyword, can be on the premise of suitably increase data volume, URL is indexed by METAFILE keyword, and mapping is set up with associative key, semantic retrieval is carried out to network documentation using keyword.

Description

A kind of web crawlers for network documentation extracts URL and indexes and reflected with keyword The framework penetrated
Technical field
The framework for extracting URL the present invention relates to a kind of web crawlers for network documentation and indexing and being mapped with keyword
Background technology
Current search engine is scanned for just for text, can't be effectively to the multimedias such as music, picture and video text Part is scanned for, and it is too big that reason is mainly multi-medium data amount;How index multi-media file;And then to treated multimedia Document retrieval.Now have rise that substantial amounts of multimedia file, particularly social network sites are shared with multimedia on the internet, it is necessary to Multimedia file is precisely retrieved.
Web crawlers, also referred to as Web Spider, network robot, are a programs for automatically extracting webpage, and it is from internet Contained network page, is the important component of search engine up and down.Web crawlers utilizes the http protocol of standard, according to hyperlink and The method traversal internet information space of network documentation retrieval.There are thousands of kinds of different data types on internet, HTTP is to every The data format label of entitled mime type will all have been stamped by the object of network transmission by planting.URL (URL) It is the most common form of resource identifier.URL describes the ad-hoc location of certain resource on a particular server.Element files (METAFILE) metamessage about the page can be provided, can pin such as search engine and description and the keyword of update frequency The keyword of element is indexed.
The data of web search are often higher-dimension, and its dimension is even up to million orders of magnitude.It was found that and the high dimension of utilization Low dimensional structures in, are particularly important in web search.In addition, in web search, people can only observe on a small quantity Element, it is desirable to according to these limited information, a great number of elements do not seen can be guessed, so as to recover a unknown low-rank Matrix or approximate low-rank matrix.
Given that it is known that data have been arranged in a high dimensional data or sample matrix.The problem of estimating a lower-dimensional subspace is referred to as low Order matrix approximation., being capable of the impaired member of automatic identification when some elements of low-rank matrix or sample matrix are seriously damaged Element, accurately recovers former low-rank matrix., it is necessary to be a low-rank matrix and one by a data matrix decomposition in web search Individual sparse matrix sum, and it is desirable that recover low-rank matrix and sparse matrix simultaneously.
The frame for extracting URL the invention provides a kind of web crawlers for network documentation and indexing and being mapped with keyword Frame, can be indexed on the premise of suitably increase data volume by METAFILE keyword to URL, and and related keyword Word sets up mapping, and semantic retrieval is carried out to network documentation using keyword.
The content of the invention
URL is extracted it is an object of the invention to provide a kind of web crawlers for network documentation and is indexed and and keyword The framework of mapping.The present invention includes following characteristics:
Inventive technique scheme
1. a kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword, its is specific Step is as follows:
1) web crawlers is since traversal parameter and starting URL;
2) the contained network pages above and below network of first URL in URL storehouses are used;
3) repeated pages inspection is passed it to, the accuracy for repeating to verify depends on specific traversal parameter;
If 4) webpage is not rejected, it is saved in web page library;
5) and pass to link extract;
6) link is extracted extracts link from the METAFILE of webpage, passes to URL inspections;If accessed before, or The standard listed in traversal parameter list is not met, then refusal is downloaded;
7) while extracting keyword, keywords database is passed to, in case semantic retrieval;
8) URL not being rejected is indexed, passs URL storehouses;And set up mapping with associative key;
9) then a URL not being accessed is passed to webpage and extracted by URL storehouses.
Brief description of the drawings
Fig. 1 is the frame diagram for extracting URL for the web crawlers of network documentation and indexing and being mapped with keyword.
Embodiment
This web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword, including as follows Step:
1) web crawlers is since traversal parameter and starting URL;
2) the contained network pages above and below network of first URL in URL storehouses are used;
3) repeated pages inspection is passed it to, the accuracy for repeating to verify depends on specific traversal parameter;
If 4) webpage is not rejected, it is saved in web page library;
5) and pass to link extract;
6) link is extracted extracts link from the METAFILE of webpage, passes to URL inspections;If accessed before, or The standard listed in traversal parameter list is not met, then refusal is downloaded;
7) while extracting keyword, keywords database is passed to, in case semantic retrieval;
8) URL not being rejected is indexed, passs URL storehouses;And set up mapping with associative key;
9) then a URL not being accessed is passed to webpage and extracted by URL storehouses.

Claims (1)

1. a kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword, its specific steps It is as follows:
1) web crawlers is since traversal parameter and starting URL;
2) the contained network pages above and below network of first URL in URL storehouses are used;
3) repeated pages inspection is passed it to, the accuracy for repeating to verify depends on specific traversal parameter;
If 4) webpage is not rejected, it is saved in web page library;
5) and pass to link extract;
6) link is extracted extracts link from the METAFILE of webpage, passes to URL inspections;If accessed before, or it is not inconsistent The standard listed in traversal parameter list is closed, then refusal is downloaded;
7) while extracting keyword, keywords database is passed to, in case semantic retrieval;
8) URL not being rejected is indexed, passs URL storehouses;And set up mapping with associative key;
9) then a URL not being accessed is passed to webpage and extracted by URL storehouses.
CN201710422522.4A 2017-06-08 2017-06-08 A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword Pending CN107220362A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710422522.4A CN107220362A (en) 2017-06-08 2017-06-08 A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710422522.4A CN107220362A (en) 2017-06-08 2017-06-08 A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword

Publications (1)

Publication Number Publication Date
CN107220362A true CN107220362A (en) 2017-09-29

Family

ID=59947694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710422522.4A Pending CN107220362A (en) 2017-06-08 2017-06-08 A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword

Country Status (1)

Country Link
CN (1) CN107220362A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109727048A (en) * 2017-10-31 2019-05-07 北京国双科技有限公司 Data processing method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503253A (en) * 2016-11-11 2017-03-15 张军 The framework that a kind of web crawlers for picture format extracts URL and indexes and map
CN106776694A (en) * 2016-11-11 2017-05-31 张军 A kind of network distribution type photographic search engine framework based on software definition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503253A (en) * 2016-11-11 2017-03-15 张军 The framework that a kind of web crawlers for picture format extracts URL and indexes and map
CN106776694A (en) * 2016-11-11 2017-05-31 张军 A kind of network distribution type photographic search engine framework based on software definition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109727048A (en) * 2017-10-31 2019-05-07 北京国双科技有限公司 Data processing method and device
CN109727048B (en) * 2017-10-31 2021-04-23 北京国双科技有限公司 Data processing method and device

Similar Documents

Publication Publication Date Title
CN101452470B (en) Summary-style network search engine system and search method and uses
US20220197923A1 (en) Apparatus and method for building big data on unstructured cyber threat information and method for analyzing unstructured cyber threat information
CN102693271B (en) A kind of network information recommending method and system
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN104715064B (en) It is a kind of to realize the method and server that keyword is marked on webpage
CN103218431B (en) A kind ofly can identify the system that info web gathers automatically
CN102436564A (en) Method and device for identifying falsified webpage
CN108021598B (en) Page extraction template matching method and device and server
CN111008321A (en) Recommendation method and device based on logistic regression, computing equipment and readable storage medium
CN103617174A (en) Distributed searching method based on cloud computing
CN103324622A (en) Method and device for automatic generating of front page abstract
CN103617266A (en) Personalized extension search method, device and system
US11334592B2 (en) Self-orchestrated system for extraction, analysis, and presentation of entity data
CN104881428A (en) Information graph extracting and retrieving method and device for information graph webpages
Bhardwaj et al. Web scraping using summarization and named entity recognition (ner)
CN106202349B (en) Webpage classification dictionary generation method and device
CN114154043A (en) Website fingerprint calculation method, system, storage medium and terminal
JP7395377B2 (en) Content search methods, devices, equipment, and storage media
CN106777140B (en) Method and device for searching unstructured document
CN108287831B (en) URL classification method and system and data processing method and system
CN107220362A (en) A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword
CN113806647A (en) Method for identifying development framework and related equipment
EP2711838A1 (en) Documentation parser
CN108255891A (en) A kind of method and device for differentiating type of webpage
CN109948015B (en) Meta search list result extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170929