CN105512193A - Data acquisition system and method based on browser expansion - Google Patents

Data acquisition system and method based on browser expansion Download PDF

Info

Publication number
CN105512193A
CN105512193A CN201510837235.0A CN201510837235A CN105512193A CN 105512193 A CN105512193 A CN 105512193A CN 201510837235 A CN201510837235 A CN 201510837235A CN 105512193 A CN105512193 A CN 105512193A
Authority
CN
China
Prior art keywords
browser
add
assemble
target
target web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510837235.0A
Other languages
Chinese (zh)
Inventor
吴凌峰
吴鹏越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Ctrip Business Co Ltd
Original Assignee
Shanghai Ctrip Business Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Ctrip Business Co Ltd filed Critical Shanghai Ctrip Business Co Ltd
Priority to CN201510837235.0A priority Critical patent/CN105512193A/en
Publication of CN105512193A publication Critical patent/CN105512193A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention provides a data acquisition system based on browser expansion. The system comprises a browser and an additional assembly set up by an API description of the browser. The additional assembly is used for actively polling the browser, obtaining a crawl target out of the browser, controlling the browser to open up a target page and controlling the browser to visit the target page, and further controlling the browser to obtain the crawl target out of the target page. The additional assembly is further used for controlling the browser to visit other content of the page apart from the crawl target of the target page. The additional assembly is also used for controlling the browser to close the target page. The data acquisition system based on browser expansion has following beneficial effects: crawler technology is utilized for grabbing data in a concealed manner; by directly using the browser, a visit request is made effectively and truly; all JSs of a page on the opposite side are normally executed; even if anti-crawling technology is modified by the front end of the opposite side, the browser can be automatically adapted; therefore manual cost is reduced; and success rate of grabbing is ensured to the furthest extent.

Description

Based on data acquisition system (DAS) and the method for browser extension
Technical field
The present invention relates to a kind of data acquisition system (DAS), particularly a kind of data acquisition system (DAS) based on browser extension and the collecting method based on browser extension that utilizes this data acquisition system (DAS) to realize.
Background technology
Due to past more than 10 years, the change that Internet protocol is not essential, all online (on internet) and h5 (refer to the 5th generation HTML, so-called HTML is the english abbreviation of " HTML (Hypertext Markup Language) ") etc. Website page still follow w3c (World Wide Web Consortium) standard, communication still relies on http agreement (HTML (Hypertext Markup Language)), therefore be responsible for the crawlers framework of network data acquisition, within this period, also substantially do not promote.
But along with the development of JS (JavaScript, a kind of literal translation formula script) and browser technology, the various anti-means of climbing emerge in an endless stream, and wherein especially climb so that front end is counter, because counter the climbing in front end is the first threshold that directly can stop that reptile enters.
Current existing vertical reptile is also main to configure crawl step and parameter early stage, then capture continuously accordingly, the weak point that the method mainly exists is exactly: face the anti-change of climbing strategy in the other side front end, cannot the very first time initiatively adaptive, the a period of time before developer gets involved must be caused, cannot normally capture.
And much third-party (increasing income) browser controls engine through on probation on the market at present, be found the defect that all there is various functions aspect, the not behavioural characteristic of way very simulation browser, exists easily by risk that the other side detects.
Therefore, we need more intelligent crawler system, are facing the other side front end anti-very first time of climbing strategy modification, can self-adaptation and continue the action of correct crawl.
Summary of the invention
The technical problem to be solved in the present invention be in order to overcome in prior art at present by vertical crawler technology crawl info web exist in the other side front end counter climb strategy change time cannot be initiatively adaptive and the defect of its behavioural characteristic cannot be simulated completely by existing browser, a kind of data acquisition system (DAS) based on browser extension and method are provided.
The present invention solves above-mentioned technical matters by following technical proposals:
The invention provides a kind of data acquisition system (DAS) based on browser extension, its feature is, it comprises a browser, the add-on assemble built is described based on the API (application programming interface) of this browser;
This add-on assemble is used for initiatively poll one server, and acquisition one crawls target from this server;
This add-on assemble also opens target web for controlling this browser, and controls this browser and conduct interviews to this target web, and controls this browser and from this target web, obtain this crawl target;
This add-on assemble also conducts interviews to other content of pages in this target web except this crawls target for controlling this browser;
This add-on assemble also closes this target web for controlling this browser.
Preferably, this add-on assemble utilizes the API of this browser to implement control operation to realize the access to this target web.
Preferably, this control operation comprises clicking operation, rolling operation and waves control operation.
The present invention also provides a kind of collecting method based on browser extension, and its feature is, it utilizes the above-mentioned data acquisition system (DAS) based on browser extension to realize, and this collecting method comprises the following steps:
S 1, this add-on assemble initiatively poll one server, and from this server, obtain one crawl target;
S 2, this add-on assemble controls this browser and opens target web;
S 3, this add-on assemble controls this browser and conducts interviews to this target web;
S 4, this add-on assemble controls this browser and from this target web, obtains this crawl target;
S 5, this add-on assemble controls this browser and conducts interviews to other content of pages in this target web except this crawls target;
S 6, this add-on assemble controls this browser and closes this target web.
Preferably, this add-on assemble utilizes the API of this browser to implement control operation to realize the access to this target web.
Preferably, this control operation comprises clicking operation, rolling operation and waves control operation.
On the basis meeting this area general knowledge, above-mentioned each optimum condition, can combination in any, obtains the preferred embodiments of the invention.
Positive progressive effect of the present invention is:
The present invention can utilize crawler technology to carry out the crawl of data more snugly, and owing to directly adopting browser, make all request of access all authentic and valid, and JS all on the other side's webpage is performed normally, all parameters are built and are also automatically performed by browser, do not need manual intervention, even if the other side front end is counter climbed change, browser also can adapt to automatically, greatly reduces cost of labor, and guarantees the success ratio of crawl to greatest extent.And by single or a small amount of request, the other side cannot analyze and judge that visitor is reptile or real user, makes the other side website webmaster to close down easily, ensure that the continuity of crawl behavior.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the collecting method based on browser extension of preferred embodiment of the present invention.
Embodiment
Mode below by embodiment further illustrates the present invention, but does not therefore limit the present invention among described scope of embodiments.
The present embodiment provides a kind of data acquisition system (DAS) based on browser extension and method, and this data acquisition system (DAS) comprises a browser, based on the API of this browser, the add-on assemble built is described.
Wherein, the function that this add-on assemble possesses has: this add-on assemble carries out the access of particular webpage for handling this browser, control this browser and open or close specific webpage, control some specific behavior of this browser page, such as, click, roll and wave control operation and handle this browser obtaining information from the webpage of this browser.
This add-on assemble is used for initiatively poll one server, and acquisition one crawls target from this server;
This add-on assemble also opens target web for controlling this browser, and controls this browser and conduct interviews to this target web, and controls this browser and from this target web, obtain this crawl target;
This add-on assemble also conducts interviews to other content of pages in this target web except this crawls target for controlling this browser;
This add-on assemble also closes this target web for controlling this browser.
As shown in Figure 1, this collecting method utilizes the above-mentioned data acquisition system (DAS) based on browser extension to realize, and this collecting method comprises the following steps:
Step 101, this add-on assemble initiatively poll one server, and acquisition one crawls target from this server;
Step 102, this add-on assemble control this browser and open target web;
Step 103, this add-on assemble control this browser and conduct interviews to this target web, this add-on assemble utilizes the API of this browser to implement control operation to realize the access to this target web, and wherein this control operation comprises clicking operation, rolling operation and waves control operation;
Step 104, this add-on assemble control this browser and from this target web, obtain this crawl target;
Step 105, this add-on assemble control this browser and conduct interviews to other content of pages in this target web except this crawls target;
Step 106, this add-on assemble control this browser and close this target web.
In the method, the API that add-on assemble utilizes browser (as red fox browser) to provide, direct scroll through pages or analog mouse, the keyboard click page.Because a lot of page is in order to detect the behavior of user, have and a lot of reserved bury a little, when user's mouse process or page scroll are to certain height, the network request once buried a little will be triggered, so rolled and the control operation clicked by simulation, can the navigation patterns of a real simulation domestic consumer, make the other side by the behavioral data analysis of single user, cannot judge that whether current visitor is the user of a necessary being.
By add-on assemble Reality simulation user browsing pages, and add-on assemble is after needing the target pages crawled to open and to complete data extraction, the page is not turned off immediately, but on the page, find some specific element carry out clicking operation, like this can the behavior of analog subscriber more realistically.Because domestic consumer can't access the page of a certain single kind for a long time, it will inevitably browse some other interested information, the information of such as recommendation of websites or some review information etc., and then by this group page close, carries out crawl next time.Targeted website by the user behavior analysis model in later stage (bypass), cannot be carried out the access of effective reptile and judges.
Thus, the present invention can utilize crawler technology to carry out the crawl of data more snugly, and owing to directly adopting browser, make all request of access all authentic and valid, and JS all on the other side's webpage is performed normally, all parameters are built and are also automatically performed by browser, do not need manual intervention, even if the other side front end is counter climbed change, browser also can adapt to automatically, greatly reduce cost of labor, and guarantee the success ratio of crawl to greatest extent.And by single or a small amount of request, the other side cannot analyze and judge that visitor is reptile or real user, makes the other side website webmaster to close down easily, ensure that the continuity of crawl behavior.
Although the foregoing describe the specific embodiment of the present invention, it will be understood by those of skill in the art that these only illustrate, protection scope of the present invention is defined by the appended claims.Those skilled in the art, under the prerequisite not deviating from principle of the present invention and essence, can make various changes or modifications to these embodiments, but these change and amendment all falls into protection scope of the present invention.

Claims (6)

1. based on a data acquisition system (DAS) for browser extension, it is characterized in that, it comprises a browser, based on the API of this browser, the add-on assemble built is described;
This add-on assemble is used for initiatively poll one server, and acquisition one crawls target from this server;
This add-on assemble also opens target web for controlling this browser, and controls this browser and conduct interviews to this target web, and controls this browser and from this target web, obtain this crawl target;
This add-on assemble also conducts interviews to other content of pages in this target web except this crawls target for controlling this browser;
This add-on assemble also closes this target web for controlling this browser.
2. as claimed in claim 1 based on the data acquisition system (DAS) of browser extension, it is characterized in that, this add-on assemble utilizes the API of this browser to implement control operation to realize the access to this target web.
3. as claimed in claim 2 based on the data acquisition system (DAS) of browser extension, it is characterized in that, this control operation comprises clicking operation, rolling operation and waves control operation.
4. based on a collecting method for browser extension, it is characterized in that, it utilizes and realizes based on the data acquisition system (DAS) of browser extension as claimed in claim 1, and this collecting method comprises the following steps:
S 1, this add-on assemble initiatively poll one server, and from this server, obtain one crawl target;
S 2, this add-on assemble controls this browser and opens target web;
S 3, this add-on assemble controls this browser and conducts interviews to this target web;
S 4, this add-on assemble controls this browser and from this target web, obtains this crawl target;
S 5, this add-on assemble controls this browser and conducts interviews to other content of pages in this target web except this crawls target;
S 6, this add-on assemble controls this browser and closes this target web.
5. as claimed in claim 4 based on the collecting method of browser extension, it is characterized in that, this add-on assemble utilizes the API of this browser to implement control operation to realize the access to this target web.
6. as claimed in claim 5 based on the collecting method of browser extension, it is characterized in that, this control operation comprises clicking operation, rolling operation and waves control operation.
CN201510837235.0A 2015-11-26 2015-11-26 Data acquisition system and method based on browser expansion Pending CN105512193A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510837235.0A CN105512193A (en) 2015-11-26 2015-11-26 Data acquisition system and method based on browser expansion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510837235.0A CN105512193A (en) 2015-11-26 2015-11-26 Data acquisition system and method based on browser expansion

Publications (1)

Publication Number Publication Date
CN105512193A true CN105512193A (en) 2016-04-20

Family

ID=55720175

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510837235.0A Pending CN105512193A (en) 2015-11-26 2015-11-26 Data acquisition system and method based on browser expansion

Country Status (1)

Country Link
CN (1) CN105512193A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933472A (en) * 2017-05-20 2017-07-07 南京西桥科技有限公司 A kind of user behavior data acquisition system and its control method based on mobile phone A PP
CN107644028A (en) * 2016-07-20 2018-01-30 平安科技(深圳)有限公司 The collection method and system of web data
CN108874810A (en) * 2017-05-10 2018-11-23 北京京东尚科信息技术有限公司 The method and apparatus of information collection
CN109189660A (en) * 2018-09-30 2019-01-11 北京诸葛找房信息技术有限公司 A kind of crawler recognition methods based on user's mouse interbehavior
CN109800123A (en) * 2018-12-14 2019-05-24 深圳壹账通智能科技有限公司 Automate electric quantity test method, apparatus, computer equipment and storage medium
CN111125489A (en) * 2019-12-25 2020-05-08 北京锐安科技有限公司 Data capturing method, device, equipment and storage medium
CN112800311A (en) * 2021-02-05 2021-05-14 厦门市美亚柏科信息股份有限公司 Browser page data acquisition method, terminal device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078136A1 (en) * 2000-12-14 2002-06-20 International Business Machines Corporation Method, apparatus and computer program product to crawl a web site
CN101996196A (en) * 2009-08-28 2011-03-30 中国移动通信集团公司 Dynamic webpage acquisition method and device
CN102214098A (en) * 2011-06-15 2011-10-12 中山大学 Dynamic webpage data acquisition method based on WebKit browser engine
CN102375951A (en) * 2011-10-18 2012-03-14 北龙中网(北京)科技有限责任公司 Webpage security detection method and system
CN103092936A (en) * 2013-01-08 2013-05-08 华北电力大学(保定) Real-time information acquisition method of dynamic page of Internet of Things
CN103186670A (en) * 2013-03-27 2013-07-03 中金数据系统有限公司 Method and system for integrally acquiring webpage information
CN104933138A (en) * 2015-06-16 2015-09-23 携程计算机技术(上海)有限公司 Webpage crawler system and webpage crawling method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078136A1 (en) * 2000-12-14 2002-06-20 International Business Machines Corporation Method, apparatus and computer program product to crawl a web site
CN101996196A (en) * 2009-08-28 2011-03-30 中国移动通信集团公司 Dynamic webpage acquisition method and device
CN102214098A (en) * 2011-06-15 2011-10-12 中山大学 Dynamic webpage data acquisition method based on WebKit browser engine
CN102375951A (en) * 2011-10-18 2012-03-14 北龙中网(北京)科技有限责任公司 Webpage security detection method and system
CN103092936A (en) * 2013-01-08 2013-05-08 华北电力大学(保定) Real-time information acquisition method of dynamic page of Internet of Things
CN103186670A (en) * 2013-03-27 2013-07-03 中金数据系统有限公司 Method and system for integrally acquiring webpage information
CN104933138A (en) * 2015-06-16 2015-09-23 携程计算机技术(上海)有限公司 Webpage crawler system and webpage crawling method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姜皓文: ""基于状态转换的动态爬虫系统设计与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644028A (en) * 2016-07-20 2018-01-30 平安科技(深圳)有限公司 The collection method and system of web data
CN107644028B (en) * 2016-07-20 2020-09-04 平安科技(深圳)有限公司 Method and system for collecting webpage data
CN108874810A (en) * 2017-05-10 2018-11-23 北京京东尚科信息技术有限公司 The method and apparatus of information collection
CN106933472A (en) * 2017-05-20 2017-07-07 南京西桥科技有限公司 A kind of user behavior data acquisition system and its control method based on mobile phone A PP
CN109189660A (en) * 2018-09-30 2019-01-11 北京诸葛找房信息技术有限公司 A kind of crawler recognition methods based on user's mouse interbehavior
CN109800123A (en) * 2018-12-14 2019-05-24 深圳壹账通智能科技有限公司 Automate electric quantity test method, apparatus, computer equipment and storage medium
CN111125489A (en) * 2019-12-25 2020-05-08 北京锐安科技有限公司 Data capturing method, device, equipment and storage medium
CN112800311A (en) * 2021-02-05 2021-05-14 厦门市美亚柏科信息股份有限公司 Browser page data acquisition method, terminal device and storage medium

Similar Documents

Publication Publication Date Title
CN105512193A (en) Data acquisition system and method based on browser expansion
CN106844522B (en) A kind of network data crawling method and device
CN104766014B (en) For detecting the method and system of malice network address
CN104601573B (en) A kind of Android platform URL accesses result verification method and device
CN102469113B (en) Security gateway and method for forwarding webpage by using security gateway
CN102222187B (en) Domain name structural feature-based hang horse web page detection method
US10819772B2 (en) Transformation of a content file into a content-centric social network
US8972412B1 (en) Predicting improvement in website search engine rankings based upon website linking relationships
CN106897215A (en) A kind of method gathered based on WebView webpages loading performance and user behavior flow data
CN103605738B (en) Web page access data statistical method and device
CN102486799B (en) World wide web (WWW) page processing method and device
US20110191664A1 (en) Systems for and methods for detecting url web tracking and consumer opt-out cookies
CN108664559A (en) A kind of automatic crawling method of website and webpage source code
CN109033115A (en) A kind of dynamic web page crawler system
WO2013126084A2 (en) Graphical overlay related to data mining and analytics
CN103218431A (en) System and method for identifying and automatically acquiring webpage information
CN107437026B (en) Malicious webpage advertisement detection method based on advertisement network topology
CN101382947A (en) Method and device for determining pointing distribution information in page
US20100169177A1 (en) Method and system for assessing behavior of a webpage visitor
CN104182412A (en) Webpage crawling method and webpage crawling system
CN104268282A (en) Web banner advertisement displaying method and system
CN106411868A (en) Method for automatically identifying web crawler
CN110555146A (en) method and system for generating network crawler camouflage data
CN103312692B (en) Chained address safety detecting method and device
CN105376311A (en) Method and device for determining page stay duration based on terminal access

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160420