CN103744944A - Method for re-filtering in webpage or data crawling by web crawler - Google Patents

Method for re-filtering in webpage or data crawling by web crawler Download PDF

Info

Publication number
CN103744944A
CN103744944A CN201310754635.6A CN201310754635A CN103744944A CN 103744944 A CN103744944 A CN 103744944A CN 201310754635 A CN201310754635 A CN 201310754635A CN 103744944 A CN103744944 A CN 103744944A
Authority
CN
China
Prior art keywords
information
webpage
rope
url
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310754635.6A
Other languages
Chinese (zh)
Inventor
朱龙腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI BOSHI INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Original Assignee
SHANGHAI BOSHI INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI BOSHI INFORMATION SCIENCE & TECHNOLOGY Co Ltd filed Critical SHANGHAI BOSHI INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority to CN201310754635.6A priority Critical patent/CN103744944A/en
Publication of CN103744944A publication Critical patent/CN103744944A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method for re-filtering in webpage or data crawling by a web crawler. The method includes the processes: inputting keywords of information to be searched; searching the address of a URL (uniform resource locator) by a server; crawling information of a target webpage from the searched address of the URL; inputting secondary search keywords again; crawling the information of the webpage again; outputting target information. Based on automatic webpage search by the web crawler, the webpage is re-filtered, the information quantity of an internet is quite large, people need to consume a lot of manpower and do not know whether the information is the best or not if people want to search the target information, search information is detailed by the method, and people can conveniently and effectively acquire the target information.

Description

The method that web crawlers refilters when capturing webpage or data
Invention field
The present invention relates to a kind of method that captures webpage in rope process of receiving, belong to networking technology area.
Background technology
Web crawlers is a program of automatically extracting webpage, and it is search engine downloading web pages WWW, is the important composition of search engine.Tradition reptile, from the URL of one or several Initial pages, obtains the URL on Initial page, and in capturing the process of webpage, constantly from current page, extracting new URL puts into queue, until meet certain stop condition of system.Web crawlers is a kind of according to certain rule, captures automatically program or the script of WWW information.The name that other is seldom used also has ant, automatic indexing, simulator program or worm.Its receive rope target web accuracy be not also very high, for we obtain the information needing, brought certain difficulty.For this reason, we to propose a kind of web crawlers are the methods of filtering capturing webpage or data.
Summary of the invention
The present invention captures the inaccurate problem of target web, a kind of method that provides web crawlers to refilter when capturing webpage or data for solving current web crawlers in receiving rope process.The present invention includes following steps:
Step 1: input need to be received the keyword of rope information;
Step 2: server is received the address of rope URL;
Step 3: the information that captures target web from received rope URL address;
Step 4: again input secondary and receive rope keyword;
Step 5: the information that again captures webpage;
Step 6: export target information.
Invention effect: the present invention automatically receives at web crawlers on the basis of rope webpage webpage is filtered again, now quantity of information is on the internet very large, if we are wanted to look for target information, need to expend very large manpower, and do not know whether this information is best one, the method refinement receipts rope information, for we obtain target information, provide method easily and effectively.
Accompanying drawing explanation
Fig. 1 is that web crawlers refilters the process flow diagram of method when capturing webpage or data.
Embodiment
Embodiment: refilter the process flow diagram 1 of method when capturing webpage or data referring to web crawlers, present embodiment is comprised of following steps:
Step 1: input need to be received the keyword of rope information;
Step 2: server is received the address of rope URL;
Step 3: the information that captures target web from received rope URL address;
Step 4: again input secondary and receive rope keyword;
Step 5: the information that again captures webpage;
Step 6: export target information.
Input need to be received the length of the keyword of rope information and do not limit, server is received the address of rope URL and before keyword is analyzed, then the address of the receipts rope URL selecting, the information that captures target web from received rope URL address shows with the form of list, and again inputting secondary, to receive rope keyword be descriptive words more specifically in target information.
To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned one exemplary embodiment, and in the situation that not deviating from spirit of the present invention or essential characteristic, can realize the present invention with other concrete form.Therefore, no matter from which point, all should regard example as exemplary, and be nonrestrictive, scope of the present invention is limited by claims rather than above-mentioned explanation, is therefore intended to include in the present invention dropping on the implication that is equal to important document of claim and all changes in scope.Any Reference numeral in claim should be considered as limiting related claim.

Claims (4)

1. the method that web crawlers refilters when capturing webpage or data, is characterized in that it is realized by following steps:
Step 1: input need to be received the keyword of rope information;
Step 2: server is received the address of rope URL;
Step 3: the information that captures target web from received rope URL address;
Step 4: again input secondary and receive rope keyword;
Step 5: the information that again captures webpage;
Step 6: export target information.
2. the method refiltering when capturing webpage or data according to web crawlers described in claims 1, is characterized in that: server described in step 2 is received the address of rope URL and before keyword analyzed, the address of the receipts rope URL then selecting.
3. the method refiltering when capturing webpage or data according to web crawlers described in claims 1, is characterized in that: described in step 3, from received rope URL address, capture the information of target web with the form demonstration of list.
4. the method refiltering when capturing webpage or data according to web crawlers described in claims 1, is characterized in that: described in step 4, again inputting secondary, to receive rope keyword be descriptive words more specifically in target information.
CN201310754635.6A 2013-12-31 2013-12-31 Method for re-filtering in webpage or data crawling by web crawler Pending CN103744944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310754635.6A CN103744944A (en) 2013-12-31 2013-12-31 Method for re-filtering in webpage or data crawling by web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310754635.6A CN103744944A (en) 2013-12-31 2013-12-31 Method for re-filtering in webpage or data crawling by web crawler

Publications (1)

Publication Number Publication Date
CN103744944A true CN103744944A (en) 2014-04-23

Family

ID=50501962

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310754635.6A Pending CN103744944A (en) 2013-12-31 2013-12-31 Method for re-filtering in webpage or data crawling by web crawler

Country Status (1)

Country Link
CN (1) CN103744944A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537040A (en) * 2014-12-23 2015-04-22 小米科技有限责任公司 Method and device for capturing webpage content and electronic device
CN105302876A (en) * 2015-09-28 2016-02-03 孙燕群 Regular expression based URL filtering method
WO2017113324A1 (en) * 2015-12-31 2017-07-06 孙燕群 Regular expression-based url filtering method
CN107704515A (en) * 2017-09-01 2018-02-16 安徽简道科技有限公司 Data grab method based on internet data grasping system
CN110334280A (en) * 2019-07-10 2019-10-15 中国民航信息网络股份有限公司 A kind of method and device of discovery confidential information leakage

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060053092A1 (en) * 2004-09-01 2006-03-09 Chris Foo Method and system to perform dynamic search over a network
EP1975816A1 (en) * 2007-03-28 2008-10-01 British Telecommunications Public Limited Company Electronic document retrieval system
CN101847161A (en) * 2010-06-02 2010-09-29 苏州搜图网络技术有限公司 Method for searching web pages and establishing database
CN102253939A (en) * 2010-05-17 2011-11-23 无锡艾斯科软件有限公司 Searching method and system based on cloud computing technology
CN102270331A (en) * 2011-08-14 2011-12-07 黄斌 Network shopping navigating method based on visual search
CN102375813A (en) * 2010-08-09 2012-03-14 腾讯科技(深圳)有限公司 Duplicate detection system and method for search engines

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060053092A1 (en) * 2004-09-01 2006-03-09 Chris Foo Method and system to perform dynamic search over a network
EP1975816A1 (en) * 2007-03-28 2008-10-01 British Telecommunications Public Limited Company Electronic document retrieval system
CN102253939A (en) * 2010-05-17 2011-11-23 无锡艾斯科软件有限公司 Searching method and system based on cloud computing technology
CN101847161A (en) * 2010-06-02 2010-09-29 苏州搜图网络技术有限公司 Method for searching web pages and establishing database
CN102375813A (en) * 2010-08-09 2012-03-14 腾讯科技(深圳)有限公司 Duplicate detection system and method for search engines
CN102270331A (en) * 2011-08-14 2011-12-07 黄斌 Network shopping navigating method based on visual search

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537040A (en) * 2014-12-23 2015-04-22 小米科技有限责任公司 Method and device for capturing webpage content and electronic device
CN105302876A (en) * 2015-09-28 2016-02-03 孙燕群 Regular expression based URL filtering method
WO2017113324A1 (en) * 2015-12-31 2017-07-06 孙燕群 Regular expression-based url filtering method
CN107704515A (en) * 2017-09-01 2018-02-16 安徽简道科技有限公司 Data grab method based on internet data grasping system
CN110334280A (en) * 2019-07-10 2019-10-15 中国民航信息网络股份有限公司 A kind of method and device of discovery confidential information leakage

Similar Documents

Publication Publication Date Title
Zhao Web scraping
CN103744944A (en) Method for re-filtering in webpage or data crawling by web crawler
CN104933056A (en) Uniform resource locator (URL) de-duplication method and device
CN104615627B (en) A kind of event public feelings information extracting method and system based on microblog
CN102254027A (en) Method for obtaining webpage contents in batch
US20160253295A1 (en) Method, device, terminal and computer storage medium for realizing intelligent reading of a browser
WO2014000537A1 (en) System and method for finding phishing website
CN105528422A (en) Focused crawler processing method and apparatus
CN105302876A (en) Regular expression based URL filtering method
CN103823907A (en) Method, device and engine for integrating on-line video resource addresses
CN103077250A (en) Method and device for capturing webpage content
CN103984749A (en) Focused crawler method based on link analysis
CN103823792A (en) Method and equipment for detecting hotspot events from text document
CN106021418A (en) News event clustering method and device
CN104991904A (en) Page data acquisition method of dynamic webpage
CN104598536B (en) A kind of distributed network information structuring processing method
US10491606B2 (en) Method and apparatus for providing website authentication data for search engine
CN105488402A (en) Dark link detection method and system
CN103605773A (en) Multimedia file searching method and device
CN104008213A (en) Method and device for finding and counting webpage information updating
JP2014532220A (en) Net comment collection method and system
CN103838865A (en) Method and device for mining timeliness seed page
CN103761669A (en) Method for applying web spider technology on online shopping
CN102819613A (en) RSS (really simple syndication) information paging fetching system and method
CN104268284A (en) Web browse filtering softdog device special for juveniles

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140423

RJ01 Rejection of invention patent application after publication