CN110472126A - A kind of acquisition methods of page data, device and equipment - Google Patents

A kind of acquisition methods of page data, device and equipment Download PDF

Info

Publication number
CN110472126A
CN110472126A CN201810442578.0A CN201810442578A CN110472126A CN 110472126 A CN110472126 A CN 110472126A CN 201810442578 A CN201810442578 A CN 201810442578A CN 110472126 A CN110472126 A CN 110472126A
Authority
CN
China
Prior art keywords
page data
page
webpage
data
domain name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810442578.0A
Other languages
Chinese (zh)
Inventor
齐希
朱骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Zhejiang Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201810442578.0A priority Critical patent/CN110472126A/en
Publication of CN110472126A publication Critical patent/CN110472126A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the present invention provides the acquisition methods, device and equipment of a kind of page data, by receiving selection of the user to page elements, the extensible markup language path language of selected page elements is extracted, to obtain the domain name of webpage;Web page structure analysis is carried out to webpage according to domain name, obtains the uniform resource locator in webpage;Uniform resource locator in webpage is sent to crawler engine, crawler engine is extracted according to uniform resource locator and back page data;It receives and the page data that crawler engine returns is returned into user;So that user can obtain all page datas in the corresponding website of domain name extracted according to page elements by the selection of the page elements to webpage, acquisition script is write without user, the technical threshold of the page data acquisition of webpage is reduced, and improves the efficiency of page data acquisition.

Description

A kind of acquisition methods of page data, device and equipment
Technical field
The present embodiments relate to technical field of data processing, more particularly, to a kind of page data acquisition methods, Device and equipment.
Background technique
With the rise of big data, enterprise is to data assets pay attention to day by day, after completing own data platform and building, confusingly Confused self-built dedicated or rental public cloud crawler platform carries out internet data acquisition, polymerize inside and outside data to promote overall data Value.
Internet data is excavated at present, crawler engine acquisition page data is widely used, as shown in Figure 1, main Collecting flowchart is as follows: 1, customer analysis internet data excavates business scenario, confirms target webpage and page elements;2, net is analyzed Page structure writes acquisition script;3, crawler engine is scheduled and acquires page data by acquisition script;4, by collected page Face data is persisted to database;5, user extracts structural data from database.
Due to needing to write acquisition script in internet data mining process, needing experienced engineer to understand at present It completes to write acquisition script after data requirements, has higher technical threshold for user, and even for different web pages It will cause the collection that needs to edit and interview again for the change of the low volume data demand of same webpage and write script, acquire script reusability Difference, the low efficiency of data acquisition.
Summary of the invention
In order to overcome the above problem or at least be partially solved the above problem, the embodiment of the present invention provides a kind of page number According to acquisition methods, device and equipment.
The embodiment of the present invention provides a kind of acquisition methods of page data, comprising: obtains the page in the webpage that user chooses Surface element extracts the corresponding extensible markup language path language of page elements, and according to extensible markup language path language Obtain the domain name of webpage;Web page structure analysis is carried out to webpage according to domain name, obtains first resource finger URL set, first resource Finger URL collection is combined into the set of the composition of the uniform resource locator in webpage;First resource finger URL set is sent to crawler to draw It holds up, so that crawler engine extracts simultaneously back page data according to first resource finger URL set;Receive the page that crawler engine returns Face data, to user's back page data.
The embodiment of the present invention provides a kind of acquisition device of page data, comprising: parsing module, analysis module, data hair Send module and data return module;Parsing module, the page elements in webpage for obtaining user's selection, extracts page elements Corresponding extensible markup language path language, and according to the domain name of extensible markup language path language acquisition webpage;Analysis Module obtains first resource finger URL set, first resource finger URL for carrying out web page structure analysis to webpage according to domain name Collect the set for the uniform resource locator composition being combined into webpage;Data transmission blocks are used for first resource finger URL set It is sent to crawler engine, so that crawler engine extracts page data according to first resource finger URL set;Data return module is used In the page data for receiving the return of crawler engine, to user's back page data.
The embodiment of the present invention provides a kind of acquisition equipment of page data, comprising: at least one processor, at least one deposits Reservoir and communication bus;Wherein: processor and memory complete mutual communication by communication bus;Memory is stored with can The program instruction being executed by processor, processor caller are instructed to execute the above method.
The embodiment of the present invention provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Matter stores computer program, which makes computer execute above-mentioned method.
Acquisition methods, device and the equipment of a kind of page data provided in an embodiment of the present invention, by receiving user to page The selection of surface element extracts the extensible markup language path language of selected page elements, to obtain the domain name of webpage;Root Web page structure analysis is carried out to webpage according to domain name, obtains the uniform resource locator in webpage;Unified resource in webpage is determined Position symbol is sent to crawler engine, and crawler engine is extracted according to uniform resource locator and back page data;It receives and by crawler The page data that engine returns returns to user;To which user can be obtained by the selection of the page elements to webpage according to page All page datas in the corresponding website of domain name that surface element extracts, write acquisition script without user, reduce webpage The technical threshold of page data acquisition, and improve the efficiency of page data acquisition.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the collecting flowchart figure of page data in the prior art;
Fig. 2 is the flow chart according to the acquisition methods of the page data of the embodiment of the present invention;
Fig. 3 is the schematic diagram according to the structure of web page of the embodiment of the present invention;
Fig. 4 is the schematic diagram according to the acquisition device of the page data of the embodiment of the present invention;
Fig. 5 is the schematic diagram according to the acquisition equipment of the page data of the embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
A kind of acquisition methods of page data of the offer of the embodiment of the present invention, with reference to Fig. 2, comprising: S21 obtains user's choosing The page elements in webpage taken extract the corresponding extensible markup language path language of page elements, and according to expansible mark Remember that language path language obtains the domain name of webpage;S22 carries out web page structure analysis to webpage according to domain name, obtains first resource Finger URL set, first resource finger URL collection are combined into the set of the composition of the uniform resource locator in webpage;S23, by the first money Source finger URL set is sent to crawler engine, so that crawler engine extracts simultaneously back page number according to first resource finger URL set According to;S24 receives the page data that crawler engine returns, to user's back page data.
Specifically, a certain computer or computer on the WWW that forms of character string that domain name is separated by a string with point The title of group, for identifying the electronic bearing of computer when data are transmitted, domain name is used in Domain Name System, can It makes one more easily to access internet, without spending, remember can be by IP address number string that machine is directly read.Webpage is one Text-only file comprising html tag, is one " page " in WWW, and webpage passes through hyperlink (the i.e. unified resource in the page Finger URL) it is connected with each other.Uniform resource locator (Uniform Resource Locator, URL) is to can be from internet On the succinct expression of obtained one kind of the position of resource and access method, be the address of standard resource on internet, to net When page design, writing for uniform resource locator has to comply with internet standard, and uniform resource locator need to usually be based on domain name It is write.
Extensible markup language path language (XML Path Language, XPath) one kind is in extensible markup language (XML) language of query information in, XPath inquire node or node collection in XML document using path expression, wherein It include the domain-name information of website in node or the corresponding character string of node collection in XML document.On page elements, that is, webpage The each element, including text, picture, audio, animation, video etc. of display, user can by click page elements realize from One web page interlinkage is to another webpage.The domain name and corresponding unification of website are preserved in the corresponding XML document of page elements Resource Locator, for obtaining page data and data transmission.
In the present embodiment, user selects a certain page elements in webpage, such as is chosen by mouse or the side of frame choosing Formula chooses the page elements, and the acquisition equipment of page data is responsive to the selection to page data, extracts selected page member The extensible markup language path language of element includes the domain name of webpage in extensible markup language path language, extracts the domain name.
When Web page developer is designed webpage, webpage design, page data need to be carried out based on certain structure of web page Acquisition equipment the uniform resource locator in webpage can be obtained by the analysis to structure of web page, and by the unification in webpage Resource Locator composition first resource finger URL set is simultaneously sent to crawler engine, and crawler engine is by first resource finger URL set In uniform resource locator extract address as data page data crawled, and returned to the acquisition equipment of page data Return crawled page data.The page data of return is returned to user, such user's energy by the acquisition equipment of page data Obtain all page datas in the corresponding website of domain name extracted according to page elements.
The present embodiment extracts the extensible markup of selected page elements by receiving selection of the user to page elements Language path language, to obtain the domain name of webpage;Web page structure analysis is carried out to webpage according to domain name, obtains the unification in webpage Resource Locator;Uniform resource locator in webpage is sent to crawler engine, crawler engine is according to uniform resource locator Extract simultaneously back page data;It receives and the page data that crawler engine returns is returned into user;To which user passes through to net The selection of the page elements of page can obtain all page datas in the corresponding website of domain name extracted according to page elements, nothing It needs user to write acquisition script, reduces the technical threshold of the page data acquisition of webpage, and improve page data acquisition Efficiency.
Based on above embodiments, after the page data for receiving the return of crawler engine, further includes: by page data persistence Page data retrieval is carried out to database to database, and according to domain name, to obtain the search result retrieved to page data;Phase It answers, to user's back page data, comprising: return to search result to user, search result includes page data.
Specifically, the acquisition equipment of page data is after receiving the page data that crawler engine returns, it is also necessary to page Face data carries out persistence, and more particularly by page data storage into database, the acquisition equipment of page data can basis The domain name of webpage retrieves database, and returns to search result to user.After page data is persisted to database, use Data needed for it can be obtained based on database, more easily retrieve data, manage area update.
Based on above embodiments, web page structure analysis is carried out to webpage according to domain name, extracts first resource finger URL set, Include: the root node determined according to domain name in the structure of web page of webpage, using root node as start node and is based on depth-first Algorithm traverses the node in structure of web page, obtains the text-string of each node;Regular expressions are determined according to domain name Formula includes the corresponding character string of domain name in regular expression;The text-string of each node is matched according to regular expressions, with The uniform resource locator in webpage is obtained, the uniform resource locator in webpage forms first resource finger URL set.
Specifically, the structure of web page of webpage is usually Multiway Tree Structure as shown in Figure 3, has fixed chain between node It connects relationship, in the present embodiment, the root node of structure of web page is determined according to the domain name of website, using root node as the starting of traversal Point traverses the node in structure of web page according to the linking relationship between node, wherein it is excellent that depth can be used in ergodic algorithm First algorithm, by the way that the traversal level of depth-priority-searching method appropriate, the more efficient node progress time in structure of web page is arranged It goes through.
Regular expression is a kind of logical formula to string operation, is exactly with predefined some specific words The combination of symbol and these specific characters, forms one " regular character string ", this " regular character string " is used to express to character string A kind of filter logic.Since uniform resource locator is write based on domain name, in the present embodiment, canonical is write according to domain name Expression formula makes in the regular expression to include the corresponding character string of domain name, to match by the regular expression canonical Uniform resource locator associated with domain name in webpage, and unrelated uniform resource locator is filtered out, such as wide in webpage Accuse corresponding uniform resource locator.The text-string of each node is matched, according to regular expressions to obtain in webpage Uniform resource locator, the uniform resource locator in webpage form first resource finger URL set.
The present embodiment traverses the node in structure of web page by depth-priority-searching method, obtains the corresponding character string of node, and Regular expression is write according to domain name, canonical matches unified money in the corresponding character string of each node by regular expression Source finger URL, ensure that crawl by a data can obtain related page data all in webpage, avoid existing Different data requirementss in technology according to user need repeatedly to carry out the case where page data crawls to webpage, and avoid To crawling for unrelated page data.
Based on above embodiments, by page data persistence to database, comprising: by each page data and each page The corresponding uniform resource locator storage of data keeps each page data and each page data corresponding into database Mapping relations between uniform resource locator.
Specifically, by each page data and the corresponding uniform resource locator of each page data to keep corresponding The modes of mapping relations store in database, such as can storage by Key-Value model realization to data, wherein Key field is uniform resource locator, and Value field is page data;It can also include domain name, Value field in Key field In can also include page data storage moment and uniform resource locator etc., HBase can be selected in database.
The present embodiment is by storing each page data and the corresponding uniform resource locator of each page data to number According in library, and the mapping relations between each page data and the corresponding uniform resource locator of each page data are kept, protected Card can accurately find corresponding page data by uniform resource locator, and can be realized according to uniform resource locator to data The update in library and easily management.
Based on above embodiments, first resource finger URL set is sent to before crawler engine, further includes: by the first money Uniform resource locator in the finger URL set of source matches one by one with the uniform resource locator in database;By successful match Uniform resource locator is filtered out from first resource finger URL set.
Specifically, stored page data in correspondence database, Internet resources will be wasted by crawling again, and the present embodiment exists Before first resource finger URL set is sent to crawler engine, it need to filter out in first resource finger URL set in the database Through existing uniform resource locator;It is more specifically by the uniform resource locator and data in first resource finger URL set Uniform resource locator in library matches one by one, will match identical uniform resource locator from first resource finger URL set Removal.
The present embodiment is by filtering out first resource finger URL before first resource finger URL set is sent to crawler engine Already existing uniform resource locator in the database in set avoids crawler engine and repeats to crawl identical page number According to saving Internet resources.
Based on above embodiments, page data retrieval is carried out to database according to domain name, page data is retrieved with obtaining Search result, comprising: match Secondary resource finger URL set, Secondary resource finger URL set in the database according to domain name The set formed for the uniform resource locator in database including the corresponding character string of domain name;Second is extracted according to mapping relations The corresponding page data of each uniform resource locator in Resource Locator set;According to each in Secondary resource finger URL set Uniform resource locator and the corresponding page data of each uniform resource locator generate DOM Document Object Model, by document object mould Type is as search result.
Specifically, the uniform resource locator with the corresponding character string of domain name is matched in the database, such as In Key-Value model, the uniform resource locator in KEY field with the corresponding character string of domain name, the matching process are matched Canonical matching process can also be used, matching, energy are filtered by using the regular expression for including the corresponding character string of domain name It is effectively matched out required uniform resource locator.Each uniform resource locator pair of successful match is extracted according to mapping relations The page data answered, and each uniform resource locator and the corresponding page data of each uniform resource locator are generated into document Search result is returned to user using DOM Document Object Model as search result by object model.Wherein, with a certain web film For, following JSON string form can be used in DOM Document Object Model:
{
"m_domain":"movie.douban.com",
" m_lab ": " film information ",
"m_url_reg":"https://movie.douban.com/subject/*",
" c1_lab ": " title ", " c1_dom ": " // * [@id='content']/h1/span [1] ",
" c2_lab ": " show time ", " c2_dom ": " // * [@id='content']/h1/span [2] "
}
The present embodiment, which passes through, will inquire the page data modelling display obtained, and common business personnel also can easily make With the acquisition methods of the page data of the present embodiment, further reduced the acquisition methods of page data uses threshold.
Based on above embodiments, after page data persistence to database, further includes: every page in monitor database The storage moment of face data;Expired page data is determined according to the storage moment;The corresponding unified resource of expired page data is determined Position symbol is sent to crawler engine, so that crawler engine extracts simultaneously back page data again;It is extracted again simultaneously according to crawler engine The page data of return is updated expired page data.
Specifically, the present embodiment is implemented to update to the page data in database by certain expired frequency of page data. The acquisition equipment of page data is at the storage moment for monitoring page data earlier than predetermined instant, it is determined that page data is already expired Phase is classified as expired page data.The acquisition equipment of page data positions the corresponding unified resource of expired page data Symbol is sent to crawler engine, and crawler engine extracts again and back page data;The acquisition equipment of page data is drawn according to crawler The expired page data of page data replacement for extracting and returning again is held up, to complete the update to expired page data, to guarantee Provide a user newest not out of date page data.
The embodiment of the present invention also provides a kind of acquisition device of page data, with reference to Fig. 4, comprising: parsing module 41, analysis Module 42, data transmission blocks 43 and data return module 44;Wherein:
Parsing module 41, the page elements in webpage for obtaining user's selection extract that page elements are corresponding expands Markup language path language is opened up, and obtains the domain name of webpage according to extensible markup language path language;
Analysis module 42, for, to webpage progress web page structure analysis, obtaining first resource finger URL set according to domain name, First resource finger URL collection is combined into the set of the composition of the uniform resource locator in webpage;
Data transmission blocks 43, for first resource finger URL set to be sent to crawler engine, for crawler engine root Simultaneously back page data are extracted according to first resource finger URL set;
Data return module 44, for receiving the page data of crawler engine return, to user's back page data.
The device of the embodiment of the present invention can be used for executing the technology of the acquisition methods embodiment of page data shown in Fig. 2 Scheme, it is similar that the realization principle and technical effect are similar, and details are not described herein again.
The embodiment of the present invention also provides a kind of acquisition equipment of page data, with reference to Fig. 5, comprising: at least one processor 51, at least one processor 52 and communication bus 53;Wherein: processor 51 and memory 52 are completed mutually by communication bus 53 Between communication;Memory 52 is stored with the program instruction that can be executed by processor 51, and 51 caller of processor is instructed to execute Method provided by above-mentioned each method embodiment, for example, obtain the page elements in the webpage that user chooses, extract the page The corresponding extensible markup language path language of element, and according to the domain name of extensible markup language path language acquisition webpage; Web page structure analysis is carried out to webpage according to domain name, obtains first resource finger URL set, first resource finger URL collection is combined into net The set of uniform resource locator composition in page;First resource finger URL set is sent to crawler engine, so that crawler is drawn It holds up and simultaneously back page data is extracted according to first resource finger URL set;The page data that crawler engine returns is received, to user Back page data.
The embodiment of the present invention also provides a kind of computer program product, and the computer program product is non-transient including being stored in Computer program on computer readable storage medium, the computer program include program instruction, when program instruction is by computer When execution, computer is able to carry out method provided by above-mentioned each method embodiment, for example, obtains the webpage that user chooses In page elements, extract the corresponding extensible markup language path language of page elements, and according to extensible markup language road The domain name of diameter language acquisition webpage;Web page structure analysis is carried out to webpage according to domain name, obtains first resource finger URL set, the One Resource Locator collection is combined into the set of the composition of the uniform resource locator in webpage;First resource finger URL set is sent to Crawler engine, so that crawler engine extracts simultaneously back page data according to first resource finger URL set;Crawler engine is received to return The page data returned, to user's back page data.
The embodiment of the present invention also provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage Medium storing computer program, the computer program make the computer execute method provided by above-mentioned each method embodiment, example Such as include: the page elements obtained in the webpage that user chooses, extracts the corresponding extensible markup language path language of page elements Speech, and according to the domain name of extensible markup language path language acquisition webpage;Web page structure analysis is carried out to webpage according to domain name, First resource finger URL set is obtained, first resource finger URL collection is combined into the set of the composition of the uniform resource locator in webpage; First resource finger URL set is sent to crawler engine, so that crawler engine is extracted and returned according to first resource finger URL set Return page data;The page data that crawler engine returns is received, to user's back page data.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through Computer program instructions relevant hardware is completed, and computer program above-mentioned can store to be situated between in a computer-readable storage In matter, which when being executed, executes step including the steps of the foregoing method embodiments;And storage medium above-mentioned includes: The various media that can store program code such as ROM, RAM, magnetic or disk.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.
Finally, it is stated that: the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although ginseng According to previous embodiment, invention is explained in detail, those skilled in the art should understand that: it still can be with It modifies the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;And These are modified or replaceed, the spirit and model of technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution It encloses.

Claims (10)

1. a kind of acquisition methods of page data characterized by comprising
The page elements in the webpage that user chooses are obtained, the corresponding extensible markup language path language of the page elements is extracted It says, and obtains the domain name of the webpage according to the extensible markup language path language;
Web page structure analysis is carried out to the webpage according to domain name, obtains first resource finger URL set, first money Source finger URL collection is combined into the set of the composition of the uniform resource locator in the webpage;
The first resource finger URL set is sent to crawler engine, so that the crawler engine is fixed according to the first resource Position symbol set is extracted and back page data;
The page data that the crawler engine returns is received, Xiang Suoshu user returns to the page data.
2. the method according to claim 1, wherein it is described receive page data that the crawler engine returns it Afterwards, further includes:
Page data retrieval is carried out to the database by the page data persistence to database, and according to domain name, To obtain the search result retrieved to page data;
Correspondingly, described return to the page data to the user, comprising:
The search result is returned to the user, the search result includes the page data.
3. the method according to claim 1, wherein described carry out webpage knot to the webpage according to domain name Structure analysis, extracts first resource finger URL set, comprising:
The root node in the structure of web page of the webpage is determined according to domain name, using the root node as start node and base The node in the structure of web page is traversed in depth-priority-searching method, obtains the text-string of each node;
Regular expression is determined according to domain name, includes the corresponding character string of domain name in the regular expression;
According to the text-string of each node of the regular expression matching, to obtain the positioning of the unified resource in the webpage It accords with, the uniform resource locator in the webpage forms the first resource finger URL set.
4. according to the method described in claim 2, it is characterized in that, described by the page data persistence to database, packet It includes:
By each page data and the corresponding uniform resource locator storage of each page data into the database, and keep Mapping relations between each page data and the corresponding uniform resource locator of each page data.
5. according to the method described in claim 4, it is characterized in that, described be sent to the first resource finger URL set is climbed Before worm engine, further includes:
By the uniform resource locator in the first resource finger URL set and the uniform resource locator in the database It matches one by one;
The uniform resource locator of successful match is filtered out from the first resource finger URL set.
6. according to the method described in claim 4, it is characterized in that, described carry out the page to the database according to domain name Data retrieval, to obtain the search result retrieved to page data, comprising:
Match Secondary resource finger URL set, the Secondary resource finger URL set in the database according to domain name The set formed for the uniform resource locator in the database including the corresponding character string of domain name;
According to the corresponding page of uniform resource locator each in the mapping relations extraction Secondary resource finger URL set Data;
It is corresponding according to each uniform resource locator and each uniform resource locator in the Secondary resource finger URL set Page data generates DOM Document Object Model, using the DOM Document Object Model as the search result.
7. according to the method described in claim 2, it is characterized in that, it is described by the page data persistence to database it Afterwards, further includes:
Monitor the storage moment of each page data in the database;
Expired page data is determined according to the storage moment;
The corresponding uniform resource locator of the expired page data is sent to the crawler engine, for the crawler engine Again simultaneously back page data are extracted;
The page data for extracting and returning again according to the crawler engine is updated the expired page data.
8. a kind of acquisition device of page data characterized by comprising parsing module, analysis module, data transmission blocks and Data return module;
The parsing module, the page elements in webpage for obtaining user's selection, extracting that the page elements are corresponding can Extending mark language path language, and obtain according to the extensible markup language path language domain name of the webpage;
The analysis module obtains first resource positioning for carrying out web page structure analysis to the webpage according to domain name Symbol set, the first resource finger URL collection are combined into the set of the composition of the uniform resource locator in the webpage;
The data transmission blocks, for the first resource finger URL set to be sent to crawler engine, for the crawler Engine extracts page data according to the first resource finger URL set;
The data return module, the page data returned for receiving the crawler engine, Xiang Suoshu user return to the page Face data.
9. a kind of acquisition equipment of page data characterized by comprising
At least one processor, at least one processor and communication bus;Wherein:
The processor and the memory complete mutual communication by the communication bus;The memory is stored with can The program instruction executed by the processor, the processor call described program instruction to execute as claim 1 to 7 is any The method.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer program is stored up, the computer program makes the computer execute the method as described in claim 1 to 7 is any.
CN201810442578.0A 2018-05-10 2018-05-10 A kind of acquisition methods of page data, device and equipment Pending CN110472126A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810442578.0A CN110472126A (en) 2018-05-10 2018-05-10 A kind of acquisition methods of page data, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810442578.0A CN110472126A (en) 2018-05-10 2018-05-10 A kind of acquisition methods of page data, device and equipment

Publications (1)

Publication Number Publication Date
CN110472126A true CN110472126A (en) 2019-11-19

Family

ID=68504094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810442578.0A Pending CN110472126A (en) 2018-05-10 2018-05-10 A kind of acquisition methods of page data, device and equipment

Country Status (1)

Country Link
CN (1) CN110472126A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297449A (en) * 2021-05-21 2021-08-24 南京大学 Method and system for realizing streaming crawler
CN113515682A (en) * 2021-05-19 2021-10-19 平安国际智慧城市科技股份有限公司 Data crawling method and device, computer equipment and storage medium
WO2022179128A1 (en) * 2021-02-25 2022-09-01 深圳壹账通智能科技有限公司 Crawler-based data crawling method and apparatus, computer device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060294052A1 (en) * 2005-06-28 2006-12-28 Parashuram Kulkami Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN101908071A (en) * 2010-08-10 2010-12-08 厦门市美亚柏科信息股份有限公司 Method and device thereof for improving search efficiency of search engine
CN106951451A (en) * 2017-02-22 2017-07-14 北京麒麟合盛网络技术有限公司 A kind of webpage content extracting method, device and computing device
CN107609150A (en) * 2017-08-28 2018-01-19 湖北省楚天云有限公司 A kind of interactive network reptile creation method chosen based on page elements and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060294052A1 (en) * 2005-06-28 2006-12-28 Parashuram Kulkami Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages
CN101089856A (en) * 2007-07-20 2007-12-19 李沫南 Method for abstracting network data and web reptile system
CN101908071A (en) * 2010-08-10 2010-12-08 厦门市美亚柏科信息股份有限公司 Method and device thereof for improving search efficiency of search engine
CN106951451A (en) * 2017-02-22 2017-07-14 北京麒麟合盛网络技术有限公司 A kind of webpage content extracting method, device and computing device
CN107609150A (en) * 2017-08-28 2018-01-19 湖北省楚天云有限公司 A kind of interactive network reptile creation method chosen based on page elements and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
金婵鸣等: "搜索引擎系统中网页抓取模块研究", 《现代计算机(专业版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022179128A1 (en) * 2021-02-25 2022-09-01 深圳壹账通智能科技有限公司 Crawler-based data crawling method and apparatus, computer device, and storage medium
CN113515682A (en) * 2021-05-19 2021-10-19 平安国际智慧城市科技股份有限公司 Data crawling method and device, computer equipment and storage medium
CN113297449A (en) * 2021-05-21 2021-08-24 南京大学 Method and system for realizing streaming crawler

Similar Documents

Publication Publication Date Title
US10152488B2 (en) Static-analysis-assisted dynamic application crawling architecture
CN109033358B (en) Method for associating news aggregation with intelligent entity
CN101211364B (en) Method and system for social bookmarking of resources exposed in web pages
US20080282186A1 (en) Keyword generation system and method for online activity
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
JP2013508873A (en) Method and system for processing information in an information stream
US20110246462A1 (en) Method and System for Prompting Changes of Electronic Document Content
CN102164186A (en) Method and system for realizing cloud search service
CN109657121A (en) A kind of Web page information acquisition method and device based on web crawlers
CN104133878A (en) User label generation method and device
CN102222098A (en) Method and system for pre-fetching webpage
CN106462406A (en) Interactive viewer of intermediate representations of client side code
CN104239298A (en) Text message recommendation method, server, browser and system
Saad et al. Archiving the web using page changes patterns: a case study
CN102158365A (en) User clustering method and system in weblog mining
CN110472126A (en) A kind of acquisition methods of page data, device and equipment
KR101801257B1 (en) Text-Mining Application Technique for Productive Construction Document Management
CN102760150A (en) Webpage extraction method based on attribute reproduction and labeled path
CN104598536B (en) A kind of distributed network information structuring processing method
Bernaschina et al. A big data analysis framework for model-based web user behavior analytics
Niu et al. Web scraping tool for newspapers and images data using jsonify
CN107798051A (en) Document dbject model affairs crawl device
Sohail Search Engine Optimization Methods & Search Engine Indexing for CMS Applications
CN113901169A (en) Information processing method, information processing device, electronic equipment and storage medium
Omitola et al. Capturing interactive data transformation operations using provenance workflows

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191119

RJ01 Rejection of invention patent application after publication