CN103744944A - Method for re-filtering in webpage or data crawling by web crawler - Google Patents
Method for re-filtering in webpage or data crawling by web crawler Download PDFInfo
- Publication number
- CN103744944A CN103744944A CN201310754635.6A CN201310754635A CN103744944A CN 103744944 A CN103744944 A CN 103744944A CN 201310754635 A CN201310754635 A CN 201310754635A CN 103744944 A CN103744944 A CN 103744944A
- Authority
- CN
- China
- Prior art keywords
- information
- webpage
- rope
- url
- address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method for re-filtering in webpage or data crawling by a web crawler. The method includes the processes: inputting keywords of information to be searched; searching the address of a URL (uniform resource locator) by a server; crawling information of a target webpage from the searched address of the URL; inputting secondary search keywords again; crawling the information of the webpage again; outputting target information. Based on automatic webpage search by the web crawler, the webpage is re-filtered, the information quantity of an internet is quite large, people need to consume a lot of manpower and do not know whether the information is the best or not if people want to search the target information, search information is detailed by the method, and people can conveniently and effectively acquire the target information.
Description
Invention field
The present invention relates to a kind of method that captures webpage in rope process of receiving, belong to networking technology area.
Background technology
Web crawlers is a program of automatically extracting webpage, and it is search engine downloading web pages WWW, is the important composition of search engine.Tradition reptile, from the URL of one or several Initial pages, obtains the URL on Initial page, and in capturing the process of webpage, constantly from current page, extracting new URL puts into queue, until meet certain stop condition of system.Web crawlers is a kind of according to certain rule, captures automatically program or the script of WWW information.The name that other is seldom used also has ant, automatic indexing, simulator program or worm.Its receive rope target web accuracy be not also very high, for we obtain the information needing, brought certain difficulty.For this reason, we to propose a kind of web crawlers are the methods of filtering capturing webpage or data.
Summary of the invention
The present invention captures the inaccurate problem of target web, a kind of method that provides web crawlers to refilter when capturing webpage or data for solving current web crawlers in receiving rope process.The present invention includes following steps:
Step 1: input need to be received the keyword of rope information;
Step 2: server is received the address of rope URL;
Step 3: the information that captures target web from received rope URL address;
Step 4: again input secondary and receive rope keyword;
Step 5: the information that again captures webpage;
Step 6: export target information.
Invention effect: the present invention automatically receives at web crawlers on the basis of rope webpage webpage is filtered again, now quantity of information is on the internet very large, if we are wanted to look for target information, need to expend very large manpower, and do not know whether this information is best one, the method refinement receipts rope information, for we obtain target information, provide method easily and effectively.
Accompanying drawing explanation
Fig. 1 is that web crawlers refilters the process flow diagram of method when capturing webpage or data.
Embodiment
Embodiment: refilter the process flow diagram 1 of method when capturing webpage or data referring to web crawlers, present embodiment is comprised of following steps:
Step 1: input need to be received the keyword of rope information;
Step 2: server is received the address of rope URL;
Step 3: the information that captures target web from received rope URL address;
Step 4: again input secondary and receive rope keyword;
Step 5: the information that again captures webpage;
Step 6: export target information.
Input need to be received the length of the keyword of rope information and do not limit, server is received the address of rope URL and before keyword is analyzed, then the address of the receipts rope URL selecting, the information that captures target web from received rope URL address shows with the form of list, and again inputting secondary, to receive rope keyword be descriptive words more specifically in target information.
To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned one exemplary embodiment, and in the situation that not deviating from spirit of the present invention or essential characteristic, can realize the present invention with other concrete form.Therefore, no matter from which point, all should regard example as exemplary, and be nonrestrictive, scope of the present invention is limited by claims rather than above-mentioned explanation, is therefore intended to include in the present invention dropping on the implication that is equal to important document of claim and all changes in scope.Any Reference numeral in claim should be considered as limiting related claim.
Claims (4)
1. the method that web crawlers refilters when capturing webpage or data, is characterized in that it is realized by following steps:
Step 1: input need to be received the keyword of rope information;
Step 2: server is received the address of rope URL;
Step 3: the information that captures target web from received rope URL address;
Step 4: again input secondary and receive rope keyword;
Step 5: the information that again captures webpage;
Step 6: export target information.
2. the method refiltering when capturing webpage or data according to web crawlers described in claims 1, is characterized in that: server described in step 2 is received the address of rope URL and before keyword analyzed, the address of the receipts rope URL then selecting.
3. the method refiltering when capturing webpage or data according to web crawlers described in claims 1, is characterized in that: described in step 3, from received rope URL address, capture the information of target web with the form demonstration of list.
4. the method refiltering when capturing webpage or data according to web crawlers described in claims 1, is characterized in that: described in step 4, again inputting secondary, to receive rope keyword be descriptive words more specifically in target information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310754635.6A CN103744944A (en) | 2013-12-31 | 2013-12-31 | Method for re-filtering in webpage or data crawling by web crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310754635.6A CN103744944A (en) | 2013-12-31 | 2013-12-31 | Method for re-filtering in webpage or data crawling by web crawler |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103744944A true CN103744944A (en) | 2014-04-23 |
Family
ID=50501962
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310754635.6A Pending CN103744944A (en) | 2013-12-31 | 2013-12-31 | Method for re-filtering in webpage or data crawling by web crawler |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103744944A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104537040A (en) * | 2014-12-23 | 2015-04-22 | 小米科技有限责任公司 | Method and device for capturing webpage content and electronic device |
CN105302876A (en) * | 2015-09-28 | 2016-02-03 | 孙燕群 | Regular expression based URL filtering method |
WO2017113324A1 (en) * | 2015-12-31 | 2017-07-06 | 孙燕群 | Regular expression-based url filtering method |
CN107704515A (en) * | 2017-09-01 | 2018-02-16 | 安徽简道科技有限公司 | Data grab method based on internet data grasping system |
CN110334280A (en) * | 2019-07-10 | 2019-10-15 | 中国民航信息网络股份有限公司 | A kind of method and device of discovery confidential information leakage |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060053092A1 (en) * | 2004-09-01 | 2006-03-09 | Chris Foo | Method and system to perform dynamic search over a network |
EP1975816A1 (en) * | 2007-03-28 | 2008-10-01 | British Telecommunications Public Limited Company | Electronic document retrieval system |
CN101847161A (en) * | 2010-06-02 | 2010-09-29 | 苏州搜图网络技术有限公司 | Method for searching web pages and establishing database |
CN102253939A (en) * | 2010-05-17 | 2011-11-23 | 无锡艾斯科软件有限公司 | Searching method and system based on cloud computing technology |
CN102270331A (en) * | 2011-08-14 | 2011-12-07 | 黄斌 | Network shopping navigating method based on visual search |
CN102375813A (en) * | 2010-08-09 | 2012-03-14 | 腾讯科技(深圳)有限公司 | Duplicate detection system and method for search engines |
-
2013
- 2013-12-31 CN CN201310754635.6A patent/CN103744944A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060053092A1 (en) * | 2004-09-01 | 2006-03-09 | Chris Foo | Method and system to perform dynamic search over a network |
EP1975816A1 (en) * | 2007-03-28 | 2008-10-01 | British Telecommunications Public Limited Company | Electronic document retrieval system |
CN102253939A (en) * | 2010-05-17 | 2011-11-23 | 无锡艾斯科软件有限公司 | Searching method and system based on cloud computing technology |
CN101847161A (en) * | 2010-06-02 | 2010-09-29 | 苏州搜图网络技术有限公司 | Method for searching web pages and establishing database |
CN102375813A (en) * | 2010-08-09 | 2012-03-14 | 腾讯科技(深圳)有限公司 | Duplicate detection system and method for search engines |
CN102270331A (en) * | 2011-08-14 | 2011-12-07 | 黄斌 | Network shopping navigating method based on visual search |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104537040A (en) * | 2014-12-23 | 2015-04-22 | 小米科技有限责任公司 | Method and device for capturing webpage content and electronic device |
CN105302876A (en) * | 2015-09-28 | 2016-02-03 | 孙燕群 | Regular expression based URL filtering method |
WO2017113324A1 (en) * | 2015-12-31 | 2017-07-06 | 孙燕群 | Regular expression-based url filtering method |
CN107704515A (en) * | 2017-09-01 | 2018-02-16 | 安徽简道科技有限公司 | Data grab method based on internet data grasping system |
CN110334280A (en) * | 2019-07-10 | 2019-10-15 | 中国民航信息网络股份有限公司 | A kind of method and device of discovery confidential information leakage |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao | Web scraping | |
CN103744944A (en) | Method for re-filtering in webpage or data crawling by web crawler | |
CN104933056A (en) | Uniform resource locator (URL) de-duplication method and device | |
CN104615627B (en) | A kind of event public feelings information extracting method and system based on microblog | |
CN102254027A (en) | Method for obtaining webpage contents in batch | |
US20160253295A1 (en) | Method, device, terminal and computer storage medium for realizing intelligent reading of a browser | |
WO2014000537A1 (en) | System and method for finding phishing website | |
CN105528422A (en) | Focused crawler processing method and apparatus | |
CN105302876A (en) | Regular expression based URL filtering method | |
CN103823907A (en) | Method, device and engine for integrating on-line video resource addresses | |
CN103077250A (en) | Method and device for capturing webpage content | |
CN103984749A (en) | Focused crawler method based on link analysis | |
CN103823792A (en) | Method and equipment for detecting hotspot events from text document | |
CN106021418A (en) | News event clustering method and device | |
CN104991904A (en) | Page data acquisition method of dynamic webpage | |
CN104598536B (en) | A kind of distributed network information structuring processing method | |
US10491606B2 (en) | Method and apparatus for providing website authentication data for search engine | |
CN105488402A (en) | Dark link detection method and system | |
CN103605773A (en) | Multimedia file searching method and device | |
CN104008213A (en) | Method and device for finding and counting webpage information updating | |
JP2014532220A (en) | Net comment collection method and system | |
CN103838865A (en) | Method and device for mining timeliness seed page | |
CN103761669A (en) | Method for applying web spider technology on online shopping | |
CN102819613A (en) | RSS (really simple syndication) information paging fetching system and method | |
CN104268284A (en) | Web browse filtering softdog device special for juveniles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20140423 |
|
RJ01 | Rejection of invention patent application after publication |