CN106991188A - A kind of efficient internet dynamic data automatic screening and grasping means and system - Google Patents

A kind of efficient internet dynamic data automatic screening and grasping means and system Download PDF

Info

Publication number
CN106991188A
CN106991188A CN201710232731.2A CN201710232731A CN106991188A CN 106991188 A CN106991188 A CN 106991188A CN 201710232731 A CN201710232731 A CN 201710232731A CN 106991188 A CN106991188 A CN 106991188A
Authority
CN
China
Prior art keywords
data
dynamic
internet
host
crawl
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710232731.2A
Other languages
Chinese (zh)
Inventor
史飞悦
房鹏展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Focus Technology Co Ltd
Original Assignee
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Focus Technology Co Ltd filed Critical Focus Technology Co Ltd
Priority to CN201710232731.2A priority Critical patent/CN106991188A/en
Publication of CN106991188A publication Critical patent/CN106991188A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A kind of efficient internet dynamic data automatic screening and grasping means and system, start browser first, simulate the input, click, skip operation of the page, carry out automatic screening, crawl internet dynamic data, and carry out classification preservation;1) certain data is positioned, the host element list that loading switch data is relied on begins stepping through each host element;2):All daughter element lists under the Search and Orientation host element, begin stepping through each daughter element, and obtain the title of the host element belonging to the daughter element of selection;3):Circulate 2), until all host elements are again without any daughter element;4):According to each main and sub element of above dynamic analog garbled data, start the internet data of automatic crawl dynamic load;By the method automatic screening and crawl internet dynamic data of design, and carry out classification preservation.This method and system are capable of the internet data of the crawl dynamic load of efficiently and accurately, substantially increase the efficiency and accuracy of internet dynamic data crawl.

Description

A kind of efficient internet dynamic data automatic screening and grasping means and system
Technical field
Technical field, particularly a kind of dynamic network data grasping means and system are captured the present invention relates to network data.
Background technology
With the arrival of information age, internet is richly stored with public data resource, all kinds of academic, education, business The information such as product have been dispersed throughout each network platform.Consider for security, promptness, quick sexual factor, most interconnection netting index According to being all to be presented to user by Web Dynamic loading techniques, simultaneously for some important resources, ability after User logs in is all needed Access, this causes the crawl of internet data to become more difficult.
Traditional internet data crawl is substantially the static html contents based on specified network address, will by reptile instrument The parsing and extraction of data are carried out after data content download.Given data can only be obtained by way of network address analysis webpage, It can not realize that interacting between user reaches the purpose of screening, in the html loaded simultaneously for js and ajax Technique dynamics Hold, traditional data grabber system has been at a complete loss as to what to do.Therefore for the crawl of such data, it is considered to by operating browser, mould The operation such as login, click that personification is reaches that internet data dynamic load is rendered, it is ensured that the integrality of data.
The present invention has carried out method design with crawl for the internet data screening of dynamic load and system is realized.It is first First start browser, the operation such as simulate the input of the page, click on, redirect, then automatic screening, crawl are carried out by the method for design Internet dynamic data, and carry out classification preservation.Through practice, this method and system are capable of the crawl dynamic load of efficiently and accurately Internet data, substantially increases the efficiency and accuracy of internet dynamic data crawl.
The content of the invention
The present invention seeks to against the background of the prior art, it is proposed that a kind of efficient internet dynamic data automatic screening With grasping means and system.Mainly for dynamic load internet data automatic screening and crawl carried out method design and System is realized.Start browser first, the operation such as simulate the input of the page, click on, redirect, being carried out by the method for design automatic Screening, crawl internet dynamic data, and carry out classification preservation.This method and system are capable of the crawl dynamic load of efficiently and accurately Internet data, substantially increase internet data dynamic data crawl efficiency and accuracy.
The technical scheme is that:A kind of efficient internet dynamic data automatic screening and grasping means, start first Browser, simulates the input of the page, the operation such as clicks on, redirects, and carrying out automatic screening, crawl internet by the method for design moves State data, and carry out classification preservation;The method of wherein internet dynamic data automatic screening comprises the following steps:
Step one:Certain data is positioned, the host element list that loading switch data is relied on begins stepping through each host element;
Step 2:All daughter element lists, begin stepping through each daughter element, and obtain selection under the Search and Orientation host element The title of host element belonging to daughter element;
Step 3:Circulation step two, until all host elements are again without any daughter element;
Step 4:According to each main and sub element of above dynamic analog garbled data, start the mutual of automatic crawl dynamic load Networking data;
Step 5:Circulation step one is to step 4, until all host elements, daughter element are terminated by traversal screening one by one.
The method that internet dynamic data is captured automatically, comprises the following steps:
Step one:All data element lists of current data region loading are searched, each data element of positioning is begun stepping through;
Step 2:The number information of the data element is obtained, the host element, daughter element with reference to belonging to record are created for depositing Store up the local folders of the data element content;
Step 3:The picture element in current data element is searched, image data is preserved to local corresponding folder;
Step 4:The picture element replaced in current data element source code is text label, and preserves the text after replacing Notebook data is to local corresponding file;
Step 5:Circulating repetition step 2 is to step 4, until all data are all captured and finished in current data region;
Step 6:Judge whether the page navigation element in current data region has lower one page of content, start weight if having Step one is answered to step 5;Otherwise terminate.
The present invention discloses a kind of efficient internet dynamic data grasping system, including:System initialization service module, net Stand and simulate login module, dynamic data automatic screening module, the automatic handling module of dynamic data.
The system initialization service module, the global variable for initializing system operation, including data storage root mesh Record, browser simulated operation driven object, browser page loading time-out time loads the data list information captured;
Login module is simulated in the website, for starting browser, and opens site home page, simulation is logged in;
The dynamic data automatic screening module, using dynamic data auto-screening method, each data of quick screening switching;
The automatic handling module of dynamic data, using the automatic grasping means of dynamic data, data progress is classified automatically, It is downloaded to local.
Beneficial effect:The present invention has carried out method design for the internet data crawl of dynamic load and system is real It is existing.The inventive method and system are capable of the internet data of the crawl dynamic load of efficiently and accurately, substantially increase internet and move The efficiency and accuracy of state data grabber.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of data automatic screening method of the present invention by embodiment of dynamic problem.
Fig. 2 is the schematic flow sheet of data automatic grasping means of the present invention using dynamic problem as embodiment.
Fig. 3 is the structural representation of data grabber system of the present invention by embodiment of dynamic problem.
Embodiment
Below in conjunction with accompanying drawing, using the dynamic problem data in internet as embodiment, the present invention is made further specifically It is bright.
As shown in fig.1, the flow of the dynamic problem data automatic screening method of the embodiment of the present invention, is concretely comprised the following steps:
Step 11:Grade's information element belonging to Search and Orientation problem, obtains all grades element list information, begins stepping through Each grade's element, analog selection grade element button, waits page data loaded.
Step 12:The corresponding total chapters and sections element of course of grade is searched, all not deployed total chapters and sections elements are clicked in simulation, and Wait page loading.All sub- chapters and sections elements are obtained, each sub- chapters and sections element is begun stepping through, each sub- chapters and sections of analog selection wait page Face loaded, and obtain, record total section name belonging to the sub- chapters and sections of selection.
Step 13:The corresponding all subject styles of chapters and sections are searched, each subject style element is begun stepping through, simulation is clicked on each Subject style, waits page loaded, and record the subject style title of click.
Step 14:The content screened according to above dynamic analog, positioning to problem data area is searched in current page Problem data.Next subject style in directly traveling through to step 13 if currently without problem data content;Otherwise start Capture problem data.
Step 15:Circulation step 11 is to step 14, until all grades, all chapters and sections, and all subject styles are by one by one time Screening is gone through to terminate.
All operations (data positioning, lookup, switching) of the present invention are both needed to rely on browser, so must start up first clear Look at device (only start once).And automatic screening, crawl data all need to carry out the click of the page, the operation such as redirect.
As shown in fig.2, the flow of the dynamic automatic grasping means of problem data of the embodiment of the present invention, is concretely comprised the following steps:
Step 21:Search all problem data element lists in current data region, traversal positioning to each problem data element.
Step 22:The number information of the problem data is positioned and obtained, is uniquely identified as each problem, if can not Obtain and then use the millisecond number in current time distance on January 1st, 1970 as project number.With reference to the grade of record, chapters and sections, class Topic type creates the local folders for storing the problem problem data/answer data, and form is respectively/grade/chapters and sections/class Inscribe type/numbering/question ,/grade/chapters and sections/subject style/numbering/answer.
Step 23:Search and Orientation attempts simulation and clicks on skip operation (attempting 5 times) to the answer button of the problem topic, And wait the problem to answer properties list ejection, failure is redirected if clicking on, next problem element into step 21 is traveled through. Otherwise position to the problem and answer properties list, searched whether picture element, if there is picture element, with certain rule to figure Piece element number, and picture method for down loading is called, picture is preserved to corresponding answer catalogues with numbering title;If without picture Element is then directly entered next step.
Step 24:Problem answer content area is positioned, performing picture element to source code in region replaces with text mark The js scripts of label, acquire the problem answer content text data after replacing, and preserve to correspondence answer catalogues.Search Problem answer forms X button is positioned, simulation clicks on X button, is back to the problem title field.
Step 25:The problem problem data is captured by the way of similar step 23 and step 24, by the problem topic Image data, and replace after topic text data preserve to question files.
Step 26:Circulating repetition step 22 to step 25, until current problem data area in all problem data and Corresponding answer data, which are all captured, to be finished.
Step 27:Judge whether the page navigation element in current problem data area has lower one page of content, if having next Then lower one page button is clicked in simulation to page, and waits the page to load after problem data finish, and starts repeat step 21 to step 26;It is no The data for then terminating current subject style are captured automatically.
As shown in fig.3, the data grabber system architecture of the embodiment of the present invention, including:
System initialization service module 31, problem website simulation login module 32, dynamic problem automatic screening 33, dynamic class Inscribe data crawl 34 automatically.
System initialization service module 31, includes the global variable of initialization system operation, and data storage root is browsed Device simulated operation driven object, browser page loading time-out time, loading has captured problem data grade, chapters and sections, topic type letter Breath etc..
Login module 32 is simulated in problem website, for starting browser, opens problem data homepage, searches login button simultaneously Click on, automatically enter user name, password and logged in, search problem category buttons, jump to problem page etc..
Dynamic problem data automatic screening 33, using dynamic problem data automatic screening method, quick screening switches to each Problem.
The crawl 34 automatically of dynamic problem data, using the dynamic automatic grasping means of problem data, to problem problem data with And problem answer data are classified, are downloaded to locally automatically.
A kind of embodiment of the present invention is the foregoing is only, patent is not intended to limit the invention, it is all in the present invention Spirit and principle within any modifications, equivalent substitutions and improvements for being made etc., with included in protection scope of the present invention it It is interior.

Claims (2)

1. a kind of internet dynamic data automatic screening and grasping means, it is characterized in that start browser first, the simulation page Input, click, skip operation, carry out automatic screening, crawl internet dynamic data, and carry out classification preservation;Including following step Suddenly:
Step one:Certain data is positioned, the host element list that loading switch data is relied on begins stepping through each host element;
Step 2:All daughter element lists under the Search and Orientation host element, begin stepping through each daughter element, and obtain the son member of selection The title of host element belonging to element;
Step 3:Circulation step two, until all host elements are again without any daughter element;
Step 4:According to each main and sub element of above dynamic analog garbled data, start the internet of automatic crawl dynamic load Data;
Step 5:Circulation step one is to step 4, until all host elements, daughter element are terminated by traversal screening one by one;
The method that internet dynamic data is captured automatically, comprises the following steps:
Step one:All data element lists of current data region loading are searched, each data element of positioning is begun stepping through;
Step 2:The number information of the data element is obtained, the host element, daughter element with reference to belonging to record are created for storing this The local folders of data element content;
Step 3:The picture element in current data element is searched, image data is preserved to local corresponding folder;
Step 4:The picture element replaced in current data element source code is text label, and preserves the textual data after replacing According to local corresponding file;
Step 5:Circulating repetition step 2 is to step 4, until all data are all captured and finished in current data region;
Step 6:Judge that whether page navigation element has lower one page of content in current data region, starts repeat step if having One to step 5;Otherwise terminate.
2. internet data automatic screening according to claim 1 and grasping means, it is characterized in that provided with system initialization Service module, website simulation login module, dynamic data automatic screening module, the automatic handling module of dynamic data;
The system initialization service module, the global variable for initializing system operation, including data storage root, it is clear Look at device simulated operation driven object, browser page loading time-out time loads the data list information captured;
Login module is simulated in the website, for starting browser, and opens site home page, simulation is logged in;
The dynamic data automatic screening module, using dynamic data auto-screening method, each data of quick screening switching;
Data, using the automatic grasping means of dynamic data, are carried out classifying automatically, downloaded by the automatic handling module of dynamic data To local.
CN201710232731.2A 2017-04-11 2017-04-11 A kind of efficient internet dynamic data automatic screening and grasping means and system Pending CN106991188A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710232731.2A CN106991188A (en) 2017-04-11 2017-04-11 A kind of efficient internet dynamic data automatic screening and grasping means and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710232731.2A CN106991188A (en) 2017-04-11 2017-04-11 A kind of efficient internet dynamic data automatic screening and grasping means and system

Publications (1)

Publication Number Publication Date
CN106991188A true CN106991188A (en) 2017-07-28

Family

ID=59414978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710232731.2A Pending CN106991188A (en) 2017-04-11 2017-04-11 A kind of efficient internet dynamic data automatic screening and grasping means and system

Country Status (1)

Country Link
CN (1) CN106991188A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256106A (en) * 2018-02-06 2018-07-06 深圳鼎智通讯股份有限公司 A kind of analog access website adapter system
CN108920525A (en) * 2018-06-05 2018-11-30 北京纳人网络科技有限公司 Web-based target user's screening technique device
CN109408695A (en) * 2018-09-27 2019-03-01 苏州创旅天下信息技术有限公司 Competing product data grab method and system
CN110177139A (en) * 2019-05-23 2019-08-27 中国搜索信息科技股份有限公司 A kind of ostensible mobile APP data grab method
CN112380519A (en) * 2020-11-23 2021-02-19 杭州冒险元素网络技术有限公司 Internet data capturing method
CN115277396A (en) * 2022-08-04 2022-11-01 北京智慧星光信息技术有限公司 Message driving method and system for simulating browser operation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171986A1 (en) * 2007-12-27 2009-07-02 Yahoo! Inc. Techniques for constructing sitemap or hierarchical organization of webpages of a website using decision trees
CN101515300A (en) * 2009-04-02 2009-08-26 阿里巴巴集团控股有限公司 Method and system for grabbing Ajax webpage content
CN104765746A (en) * 2014-01-06 2015-07-08 腾讯科技(深圳)有限公司 Data processing method and device for mobile communication terminal browser
CN106055714A (en) * 2016-07-06 2016-10-26 浙江工商大学 Method for capturing cloud calculating data from RIA (Rich Internet Application) page
CN106126697A (en) * 2016-06-30 2016-11-16 广州市皓轩软件科技有限公司 A kind of sing on web multidate information captures the details page automatic generation method of technology

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171986A1 (en) * 2007-12-27 2009-07-02 Yahoo! Inc. Techniques for constructing sitemap or hierarchical organization of webpages of a website using decision trees
CN101515300A (en) * 2009-04-02 2009-08-26 阿里巴巴集团控股有限公司 Method and system for grabbing Ajax webpage content
CN104765746A (en) * 2014-01-06 2015-07-08 腾讯科技(深圳)有限公司 Data processing method and device for mobile communication terminal browser
CN106126697A (en) * 2016-06-30 2016-11-16 广州市皓轩软件科技有限公司 A kind of sing on web multidate information captures the details page automatic generation method of technology
CN106055714A (en) * 2016-07-06 2016-10-26 浙江工商大学 Method for capturing cloud calculating data from RIA (Rich Internet Application) page

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256106A (en) * 2018-02-06 2018-07-06 深圳鼎智通讯股份有限公司 A kind of analog access website adapter system
CN108256106B (en) * 2018-02-06 2021-11-02 深圳鼎智通讯股份有限公司 Simulation access website adapter system
CN108920525A (en) * 2018-06-05 2018-11-30 北京纳人网络科技有限公司 Web-based target user's screening technique device
CN109408695A (en) * 2018-09-27 2019-03-01 苏州创旅天下信息技术有限公司 Competing product data grab method and system
CN110177139A (en) * 2019-05-23 2019-08-27 中国搜索信息科技股份有限公司 A kind of ostensible mobile APP data grab method
CN112380519A (en) * 2020-11-23 2021-02-19 杭州冒险元素网络技术有限公司 Internet data capturing method
CN115277396A (en) * 2022-08-04 2022-11-01 北京智慧星光信息技术有限公司 Message driving method and system for simulating browser operation
CN115277396B (en) * 2022-08-04 2024-03-26 北京智慧星光信息技术有限公司 Message driving method and system for simulating browser operation

Similar Documents

Publication Publication Date Title
CN106991188A (en) A kind of efficient internet dynamic data automatic screening and grasping means and system
CN102822815B (en) For the method and system utilizing browser history to carry out action suggestion
US8595635B2 (en) System, method and apparatus for selecting content from web sources and posting content to web logs
CN105446973B (en) The foundation of user's recommended models and application method and device in social networks
CN103514299B (en) Information search method and device
CN103744853B (en) The method and device of Research of Search Engine Website Snapshot System information is provided
CN103023753B (en) Method, client and the system of interaction content association output in instant messaging
CN104346462B (en) Preserve the method, apparatus and browser client of web page element
CN107438814A (en) Entity action suggestion on mobile device
CN102968451B (en) The browser form page loads method and the client of website data
Estellés et al. Social bookmarking tools as facilitators of learning and research collaborative processes: The Diigo case
CN102495855A (en) Automatic login method and device
RU2637882C2 (en) Method for managing web-resource displays in browser window, method of placing tabs in stack in browser window, electronic device and server
CN108959595B (en) Website construction and experience method and device based on virtual and reality
CN107340954A (en) A kind of information extracting method and device
CN107133165A (en) Browser compatibility detection method and device
CN107784113A (en) Html web page collecting method, device and computer-readable recording medium
CN112434206A (en) Question bank generating system based on web crawler and application method
CN102073678A (en) System and method for analyzing information of websites
CN106681994A (en) Method and device for automatically digging search term classification samples and recommending game projects
Ravulavaru Google Cloud AI Services Quick Start Guide: Build Intelligent Applications with Google Cloud AI Services
CN104268246B (en) Generation accesses the method and access method and device of internet sites command script
CN106547803A (en) The method and apparatus for crawling website incremental resource
CN113742496A (en) Power knowledge learning system and method based on heterogeneous resource fusion
CN106951405A (en) Data processing method and device based on typesetting engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170728

RJ01 Rejection of invention patent application after publication