CN107220362A - A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword - Google Patents
A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword Download PDFInfo
- Publication number
- CN107220362A CN107220362A CN201710422522.4A CN201710422522A CN107220362A CN 107220362 A CN107220362 A CN 107220362A CN 201710422522 A CN201710422522 A CN 201710422522A CN 107220362 A CN107220362 A CN 107220362A
- Authority
- CN
- China
- Prior art keywords
- url
- keyword
- web crawlers
- network
- indexing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Abstract
The present invention discloses a kind of web crawlers for network documentation and extracts URL and the framework for indexing and being mapped with keyword, can be on the premise of suitably increase data volume, URL is indexed by METAFILE keyword, and mapping is set up with associative key, semantic retrieval is carried out to network documentation using keyword.
Description
Technical field
The framework for extracting URL the present invention relates to a kind of web crawlers for network documentation and indexing and being mapped with keyword
Background technology
Current search engine is scanned for just for text, can't be effectively to the multimedias such as music, picture and video text
Part is scanned for, and it is too big that reason is mainly multi-medium data amount;How index multi-media file;And then to treated multimedia
Document retrieval.Now have rise that substantial amounts of multimedia file, particularly social network sites are shared with multimedia on the internet, it is necessary to
Multimedia file is precisely retrieved.
Web crawlers, also referred to as Web Spider, network robot, are a programs for automatically extracting webpage, and it is from internet
Contained network page, is the important component of search engine up and down.Web crawlers utilizes the http protocol of standard, according to hyperlink and
The method traversal internet information space of network documentation retrieval.There are thousands of kinds of different data types on internet, HTTP is to every
The data format label of entitled mime type will all have been stamped by the object of network transmission by planting.URL (URL)
It is the most common form of resource identifier.URL describes the ad-hoc location of certain resource on a particular server.Element files
(METAFILE) metamessage about the page can be provided, can pin such as search engine and description and the keyword of update frequency
The keyword of element is indexed.
The data of web search are often higher-dimension, and its dimension is even up to million orders of magnitude.It was found that and the high dimension of utilization
Low dimensional structures in, are particularly important in web search.In addition, in web search, people can only observe on a small quantity
Element, it is desirable to according to these limited information, a great number of elements do not seen can be guessed, so as to recover a unknown low-rank
Matrix or approximate low-rank matrix.
Given that it is known that data have been arranged in a high dimensional data or sample matrix.The problem of estimating a lower-dimensional subspace is referred to as low
Order matrix approximation., being capable of the impaired member of automatic identification when some elements of low-rank matrix or sample matrix are seriously damaged
Element, accurately recovers former low-rank matrix., it is necessary to be a low-rank matrix and one by a data matrix decomposition in web search
Individual sparse matrix sum, and it is desirable that recover low-rank matrix and sparse matrix simultaneously.
The frame for extracting URL the invention provides a kind of web crawlers for network documentation and indexing and being mapped with keyword
Frame, can be indexed on the premise of suitably increase data volume by METAFILE keyword to URL, and and related keyword
Word sets up mapping, and semantic retrieval is carried out to network documentation using keyword.
The content of the invention
URL is extracted it is an object of the invention to provide a kind of web crawlers for network documentation and is indexed and and keyword
The framework of mapping.The present invention includes following characteristics:
Inventive technique scheme
1. a kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword, its is specific
Step is as follows:
1) web crawlers is since traversal parameter and starting URL;
2) the contained network pages above and below network of first URL in URL storehouses are used;
3) repeated pages inspection is passed it to, the accuracy for repeating to verify depends on specific traversal parameter;
If 4) webpage is not rejected, it is saved in web page library;
5) and pass to link extract;
6) link is extracted extracts link from the METAFILE of webpage, passes to URL inspections;If accessed before, or
The standard listed in traversal parameter list is not met, then refusal is downloaded;
7) while extracting keyword, keywords database is passed to, in case semantic retrieval;
8) URL not being rejected is indexed, passs URL storehouses;And set up mapping with associative key;
9) then a URL not being accessed is passed to webpage and extracted by URL storehouses.
Brief description of the drawings
Fig. 1 is the frame diagram for extracting URL for the web crawlers of network documentation and indexing and being mapped with keyword.
Embodiment
This web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword, including as follows
Step:
1) web crawlers is since traversal parameter and starting URL;
2) the contained network pages above and below network of first URL in URL storehouses are used;
3) repeated pages inspection is passed it to, the accuracy for repeating to verify depends on specific traversal parameter;
If 4) webpage is not rejected, it is saved in web page library;
5) and pass to link extract;
6) link is extracted extracts link from the METAFILE of webpage, passes to URL inspections;If accessed before, or
The standard listed in traversal parameter list is not met, then refusal is downloaded;
7) while extracting keyword, keywords database is passed to, in case semantic retrieval;
8) URL not being rejected is indexed, passs URL storehouses;And set up mapping with associative key;
9) then a URL not being accessed is passed to webpage and extracted by URL storehouses.
Claims (1)
1. a kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword, its specific steps
It is as follows:
1) web crawlers is since traversal parameter and starting URL;
2) the contained network pages above and below network of first URL in URL storehouses are used;
3) repeated pages inspection is passed it to, the accuracy for repeating to verify depends on specific traversal parameter;
If 4) webpage is not rejected, it is saved in web page library;
5) and pass to link extract;
6) link is extracted extracts link from the METAFILE of webpage, passes to URL inspections;If accessed before, or it is not inconsistent
The standard listed in traversal parameter list is closed, then refusal is downloaded;
7) while extracting keyword, keywords database is passed to, in case semantic retrieval;
8) URL not being rejected is indexed, passs URL storehouses;And set up mapping with associative key;
9) then a URL not being accessed is passed to webpage and extracted by URL storehouses.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710422522.4A CN107220362A (en) | 2017-06-08 | 2017-06-08 | A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710422522.4A CN107220362A (en) | 2017-06-08 | 2017-06-08 | A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107220362A true CN107220362A (en) | 2017-09-29 |
Family
ID=59947694
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710422522.4A Pending CN107220362A (en) | 2017-06-08 | 2017-06-08 | A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107220362A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109727048A (en) * | 2017-10-31 | 2019-05-07 | 北京国双科技有限公司 | Data processing method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503253A (en) * | 2016-11-11 | 2017-03-15 | 张军 | The framework that a kind of web crawlers for picture format extracts URL and indexes and map |
CN106776694A (en) * | 2016-11-11 | 2017-05-31 | 张军 | A kind of network distribution type photographic search engine framework based on software definition |
-
2017
- 2017-06-08 CN CN201710422522.4A patent/CN107220362A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503253A (en) * | 2016-11-11 | 2017-03-15 | 张军 | The framework that a kind of web crawlers for picture format extracts URL and indexes and map |
CN106776694A (en) * | 2016-11-11 | 2017-05-31 | 张军 | A kind of network distribution type photographic search engine framework based on software definition |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109727048A (en) * | 2017-10-31 | 2019-05-07 | 北京国双科技有限公司 | Data processing method and device |
CN109727048B (en) * | 2017-10-31 | 2021-04-23 | 北京国双科技有限公司 | Data processing method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101452470B (en) | Summary-style network search engine system and search method and uses | |
US20220197923A1 (en) | Apparatus and method for building big data on unstructured cyber threat information and method for analyzing unstructured cyber threat information | |
CN102693271B (en) | A kind of network information recommending method and system | |
CN108737423B (en) | Phishing website discovery method and system based on webpage key content similarity analysis | |
CN104715064B (en) | It is a kind of to realize the method and server that keyword is marked on webpage | |
CN103218431B (en) | A kind ofly can identify the system that info web gathers automatically | |
CN102436564A (en) | Method and device for identifying falsified webpage | |
CN108021598B (en) | Page extraction template matching method and device and server | |
CN111008321A (en) | Recommendation method and device based on logistic regression, computing equipment and readable storage medium | |
CN103617174A (en) | Distributed searching method based on cloud computing | |
CN103324622A (en) | Method and device for automatic generating of front page abstract | |
CN103617266A (en) | Personalized extension search method, device and system | |
US11334592B2 (en) | Self-orchestrated system for extraction, analysis, and presentation of entity data | |
CN104881428A (en) | Information graph extracting and retrieving method and device for information graph webpages | |
Bhardwaj et al. | Web scraping using summarization and named entity recognition (ner) | |
CN106202349B (en) | Webpage classification dictionary generation method and device | |
CN114154043A (en) | Website fingerprint calculation method, system, storage medium and terminal | |
JP7395377B2 (en) | Content search methods, devices, equipment, and storage media | |
CN106777140B (en) | Method and device for searching unstructured document | |
CN108287831B (en) | URL classification method and system and data processing method and system | |
CN107220362A (en) | A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword | |
CN113806647A (en) | Method for identifying development framework and related equipment | |
EP2711838A1 (en) | Documentation parser | |
CN108255891A (en) | A kind of method and device for differentiating type of webpage | |
CN109948015B (en) | Meta search list result extraction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170929 |