WO2019237547A1 - Data crawling method and apparatus, and computer device and storage medium - Google Patents
Data crawling method and apparatus, and computer device and storage medium Download PDFInfo
- Publication number
- WO2019237547A1 WO2019237547A1 PCT/CN2018/106397 CN2018106397W WO2019237547A1 WO 2019237547 A1 WO2019237547 A1 WO 2019237547A1 CN 2018106397 W CN2018106397 W CN 2018106397W WO 2019237547 A1 WO2019237547 A1 WO 2019237547A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- identification information
- website
- network identification
- url
- access
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present application relates to the field of finance, and in particular, to a data crawling method, device, computer device, and storage medium.
- the traditional information crawling method is to use an IP address to frequently crawl the target website. Because the first website has an anti-crawl mode, the number of visits to the target website by an IP address is limited within a preset period of time. The number of visits to the first website within the set time period reaches the preset limit, and crawling can only be performed within the next preset time period, and even the target website blocks the IP address as a malicious IP, resulting in The stability of crawling information is low.
- a data crawling method includes: using network identification information in an identification channel to access a first webpage corresponding to a preset first URL, wherein the network identification information in the identification channel is pre-assigned by an identification information database, and The identification information database includes multiple network identification information that can successfully access network resources; if the network identification information in the identification channel is used to successfully access the first webpage corresponding to the first URL, and the first URL is not Domain name, the first URL is parsed according to a preset first resolution method to obtain the domain name corresponding to the first URL; and the network identification information is used to access the homepage of the first website corresponding to the domain name, where: The first website includes more than one second webpage, and the second webpage includes second webpage content; if the network identification information in the identification channel is used to access the first webpage corresponding to the first URL, and If the first URL is a domain name, or the first page of the first website corresponding to the domain name is successfully accessed using the network identification information, the first URL is traversed.
- a data crawling device includes a first access module for accessing a first webpage corresponding to a preset first web address by using network identification information in an identification channel, wherein the network identification information in the identification channel is previously determined by Dispatched by an identification information base, the identification information base includes a plurality of network identification information that can successfully access network resources; a first parsing module, configured to access the first URL corresponding to the network identification information by using the network identification information in the identification channel The first webpage is successful, and the first URL is a non-domain name, the first URL is parsed according to a preset first parsing method to obtain a domain name corresponding to the first URL; a second access module uses And using the network identification information to access a homepage of a first website corresponding to the domain name, wherein the first website includes more than one second web page, and the second web page includes second web page content; a traversal module is configured to: Using the network identification information in the identification channel to access the first webpage corresponding to the first URL is successful, and the first URL is a domain
- a dispatching module is configured to access the first corresponding to the first URL if the network identification information is used. If the webpage is unsuccessful, or the homepage of the first website corresponding to the domain name is not successfully accessed using the network identification information, or each of the second webpages of the first website is not successfully traversed, the Tornado asynchronous mechanism is used to dispatch the The new network identification information in the identification information database is transmitted to the identification channel, and the first access module is triggered.
- the new network identification information refers to network identification information that has not been assigned to the identification channel.
- a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor implements the data scraping method when the processor executes the computer-readable instructions A step of.
- One or more non-volatile readable storage media storing computer readable instructions, the computer readable storage medium storing computer readable instructions, so that the one or more processors execute the data scraping method described above A step of.
- FIG. 1 is a schematic diagram of an application environment of a data crawling method according to an embodiment of the present application
- FIG. 2 is a flowchart of a data crawling method according to an embodiment of the present application.
- FIG. 3 is a flowchart of an implementation of obtaining network identification information from a proxy website in a data crawling method provided by an embodiment of the present application
- step S10 is a flowchart of implementing step S10 in a data crawling method provided by an embodiment of the present application
- FIG. 5 is a flowchart of implementation of traversing each web page in a data crawling method provided by an embodiment of the present application
- FIG. 6 is a flowchart of implementation of parsing webpage content in a data crawling method provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram of a data crawling device according to an embodiment of the present application.
- FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
- the data crawling method provided in this application can be applied in the application environment shown in FIG. 1, where a computer device communicates with a server through a network.
- the server accesses the first webpage of the client corresponding to the preset first URL by using the network identification information allocated in advance by the identification information database in the identification channel. If the server uses the network identification information in the identification channel to access the first URL, If the first webpage of the client is successful and the first URL is a non-domain name, the server parses the first URL according to a preset first parsing method, so as to obtain the domain name corresponding to the first URL. Then, the server uses the network The identification information accesses the homepage of the client of the first website corresponding to the domain name.
- the server successfully uses the network identification information in the identification channel to access the first webpage of the client corresponding to the first URL, and the first URL is a domain name or is accessed using network identification information If the first page of the client of the first website corresponding to the domain name is successful, the server traverses the second pages of the client of the first website. Next, the server determines that the second pages of the client of the first website are traversed successfully.
- the preset second parsing method parses the content of the second webpage of the client, so that Data that needs to be crawled.
- the server uses network identification information to access the first webpage of the client corresponding to the first URL, or the network identification information to access the first page of the client's first website corresponding to the domain name is unsuccessful, or iterates Each second page of the client of the first website is unsuccessful, the server uses the Tornado asynchronous mechanism to assign new network identification information in the identification information database to the identification channel, and the server returns to execute the preset access using the network identification information in the identification channel.
- the first URL corresponds to the first web page of the client.
- the computer device may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
- the server can be implemented by an independent server or a server cluster composed of multiple servers.
- a data crawling method is provided.
- the data crawling method is applied in the financial industry.
- the method is applied to the server in FIG. 1 as an example for description, and includes the following steps. :
- the identification channel refers to a channel that temporarily stores network identification information that can successfully access network resources.
- the network identification information refers to the identification information of the machine in the network, that is, the IP address. IP address, the English name is Internet Protocol Address, refers to the Internet Protocol address.
- the identification information database refers to a database specifically used to store network identification information that can successfully access network resources.
- the network identification information in the identification channel is assigned in advance by an identification information base, and the identification information base includes a plurality of network identification information that can successfully access network resources.
- one identification channel has one piece of network identification information that can be used to successfully access network resources.
- the network identification information that can be used to successfully access network resources can be invalidated by external restrictions, that is, the network identification information that can be used to successfully access network resources can be blocked by a website and can no longer access the website.
- an IP address of an Internet access device is set to an IP address in a channel that temporarily stores an IP address that can successfully access network resources, and then a browser in the Internet access device is used to access a first URL corresponding to a preset first URL.
- the preset first URL can be http://www.xinhuanet.com/fortune/2018-02/08/c_129808453.html, and the specific content of the preset first URL can be based on actual application requirements. Make settings, there is no restriction here.
- the domain name refers to the name of a computer or computer group on the Internet consisting of a series of names separated by dots, used to identify the electronic position of the computer during data transmission .
- the Internet the Chinese name for the Internet, refers to a global network of computers connected to each other using a common language.
- the first webpage corresponding to the first web address is successfully accessed, and the first web address is an Internet not composed of a series of names separated by dots.
- the name of a certain computer or computer group is parsed according to a preset first resolution method to obtain a domain name corresponding to the first URL.
- the preset first parsing method may be to directly extract the content between the double slash "//” and the first single slash "/” in a URL arranged in order from left to right.
- the specific content of the first analysis method can be set according to actual application requirements, and is not limited here.
- step S20 In order to better understand step S20, an example is described below, and the specific expression is as follows:
- the Internet access device is a personal computer
- the browser is Internet Explorer
- the identification channel is channel A
- the IP address is "42.55.173.190, port 80”
- the first URL is http://news.163.com/18/ 0130/12 / D9DA7M9S000181BT.html
- the default first parsing method is to directly extract the content between the double slash "//” and the first single slash "/” in a URL.
- the IP address is The IE browser in the personal computer of "42.55.173.190, port 80" accesses the first webpage corresponding to http://news.163.com/18/0130/12/D9DA7M9S000181BT.html, and http: // news.
- 163.com/18/0130/12/D9DA7M9S000181BT.html is the name of a computer or computer group on the Internet that is not composed of a series of dot-separated names, and it directly extracts http from left to right: //news.163.com/18/0130/12/D9DA7M9S000181BT.html
- the content between the double slash "//” and the first single slash "/” is http://news.163.com /18/0130/12/D9DA7M9S000181BT.html corresponds to news.163.com.
- S30 Use the network identification information to access the homepage of the first website corresponding to the domain name, where the first website includes more than one second web page, and the second web page includes the second web page content;
- the IP address of the Internet access device is set to the IP address in the identification channel, and then the browser on the Internet access device is used to access the homepage of the first website corresponding to the domain name, where the first website includes more than one second Webpage, the second webpage includes the content of the second webpage.
- the first website may be the official website of NetEase News
- the second website includes the first website.
- the specific content of the first website may be set according to actual application requirements, and is not limited here.
- the browser in the Internet access device that uses the IP address as the IP address in the identification channel successfully accesses the first webpage corresponding to the first URL, and the first URL is a domain name, or uses the IP address to access the first website corresponding to the domain name Successful homepage, iterates through the second pages of the first website.
- the browser in the Internet access device that uses the IP address as the IP address in the identification channel successfully traverses each second webpage of the first website, the content of the second webpage is parsed according to a preset second parsing method, and the required Crawled data.
- the preset second parsing method may be parsing a webpage using a JAXP tool.
- JAXP tools are tools for processing XML documents.
- JAXP the English name for Java API for XML Processing, refers to a Java application program interface for parsing and validating XML documents.
- An XML document is a markup language document that is used to mark electronic files to make them structured.
- Java refers to an object-oriented programming language.
- the specific content of the preset second analysis method may be set according to actual application requirements, and is not limited here.
- the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and the process returns to step S10.
- the new network identification information refers to network identification information that has not been assigned to the identification channel.
- Tornado is an open source version of Web server software.
- the Web which is called World Wide Web in English, refers to a distributed graphical information system based on hypertext and HTTP, global, dynamic interaction, and cross-platform.
- HTTP in English is called HyperText Transfer Protocol. It refers to the Hypertext Transfer Protocol. It is the most widely used network protocol on the Internet. It is a rule that specifies the communication between the browser and the World Wide Web server in detail.
- Tornado asynchronous mechanism refers to a mechanism that when an asynchronous procedure call is issued, the caller cannot get the result immediately. After the component that actually handles the call completes, it informs the caller through the status, notification, and callback.
- Tornado asynchronous mechanism is implemented based on AsyncHTTPClient.
- AsyncHTTPClient refers to an asynchronous framework that uses a thread pool to process and send requests.
- the browser on the Internet device using the IP address as the IP address in the identification channel fails to access the first webpage corresponding to the first URL, or the browsing on the Internet device using the IP address as the IP address in the identification channel is not successful If the browser fails to access the homepage of the first website corresponding to the domain name, or the browser on the Internet device that uses the IP address as the IP address in the identification channel fails to traverse each second page of the first website, Tornado based on AsyncHTTPClient is used.
- the asynchronous mechanism assigns a new IP address in the identification information database to the identification channel, and returns to step S10.
- the new IP address refers to an IP address that has not been assigned to the identification channel.
- step S60 In order to better understand step S60, an example is described below, and the specific expression is as follows:
- the IP address includes "42.55.173.190, port 80" and "53.34.219.40, port 8118".
- the first URL is http://news.163.com /18/0130/12/D9DA7M9S000181BT.html
- the domain name is news.163.com
- the new IP address is "121.31.100.15, port 8123”
- the identification information database is the first mysql database
- the identification channel includes A channel and B channel .
- the first web page is unsuccessful, or the IE browser on the Internet device with an IP address of "42.55.173.190, port 80" is not successful in accessing the home page of the first website corresponding to news.163.com,
- the first web page corresponding to the preset first web address is accessed by using the network identification information pre-assigned in the identification channel. If the first web page is accessed successfully, and the first web address is not Domain name, the first URL is parsed to obtain the domain name corresponding to the first URL, and then the network identification information is used to access the home page of the first website corresponding to the domain name. If the first web page is successfully accessed, and the first URL is a domain name or If the first page is accessed successfully, each second page of the first website is traversed. Next, after determining that each second page is successfully traversed, the content of the second page is parsed to obtain data to be crawled.
- each network identification information in the identification information database is network identification information that can successfully access network resources, thereby ensuring the stability of the network identification information, ensuring normal and orderly access to network resources, and thereby improving data crawling. Take the stability and efficiency.
- the data crawling method is applied in the financial industry. As shown in FIG. 3, in step S10, that is, before using the network identification information in the identification channel to access the first webpage corresponding to the preset first URL, the data crawling method further includes:
- S70 Obtain network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;
- an IP address on the second website is extracted from a webpage corresponding to the second website according to a preset extraction manner, where the second website has more than one IP address.
- the preset extraction method can be copying or screenshot
- the second website can be the West Thorn proxy website
- the West Thorn proxy website refers to a website that provides domestic and international IP addresses.
- the preset extraction method and the specific content of the second website can be set according to actual application requirements, and there is no limitation here.
- the IP address of the Internet access device is set to the IP address extracted on the second website, and then the browser in the Internet access device is used to access the third webpage corresponding to the preset second URL.
- the second URL may be http://www.xinhuanet.com/, and the specific content of the second URL may be set according to actual application requirements, and is not limited here.
- the extracted IP from the second website will be The address is stored in the identification information database; if the browser on the Internet device whose IP address is the IP address extracted on the second website fails to access the third webpage corresponding to the preset second URL, it will be in the second The IP address extracted from the website is saved to an invalid database.
- the preset second URL is a URL that can be connected normally.
- the network identification information on the website is obtained, and the network identification information is used to access the webpage corresponding to the preset URL. If the access is successful, the network identification information is saved in the identification information database, so that The Internet obtains network identification information from all over the world from the network identification information agency website, thereby improving the convenience of obtaining network identification information.
- step S10 that is, using the network identification information in the identification channel to access the first webpage corresponding to the preset first URL specifically includes the following steps:
- S101 Use network identification information to send an HTTP request to a server corresponding to a preset first URL
- the HTTP request refers to a request message from a client to a server.
- the IP address of the Internet access device is set to the IP address in the identification channel, and then a browser in the Internet access device is used to send an HTTP request to a server corresponding to a preset first URL, where the HTTP request includes a target
- the identification information of the resource, and the identification information of the target resource uniquely identifies the target resource.
- the server when the server receives the HTTP request, it verifies the identification information of the target resource in the HTTP request. After the verification is passed, the HTML file of the target resource is fed back to the sender.
- An HTML file is a file that can be read by a variety of web browsers to generate various types of information on a web page.
- step S40 traversing each second webpage of the first website specifically includes the following steps:
- each a tag is extracted, where one a tag includes more than one href attribute.
- each href attribute is extracted from each a tag in the HTML page in the first website.
- step S401 In order to better understand step S401, step S402, and step S403, an example is described below, and the specific expression is as follows:
- the data crawling method is applied in the financial industry.
- step S50 the content of the second webpage is parsed according to a preset second parsing method, and the data to be crawled specifically includes the following steps:
- the ⁇ html> tag information and ⁇ / html> tag information of the second webpage are removed to obtain an XML document.
- the ⁇ html> tag information and ⁇ / html> tag information refer to HTML tags, and the HTML tags refer to hypertext. Markup language markup tags.
- S502 Parse the XML document to obtain a document object tree in the XML document.
- the document object tree refers to a tree constructed by a Document object.
- the Document object refers to the document of the web page in the browser window.
- the document object tree contains more than one text node information.
- the XML document is parsed to obtain a document object tree in the XML document, where the document object tree includes more than one DOM node information.
- DOM node information refers to a DOM object in an XML document
- a DOM object refers to a collection of nodes or information pieces organized in a hierarchical structure.
- the text node information is DOM node information.
- each DOM node information in the document object tree is extracted.
- S504 The information of each text node is spliced according to a preset splicing method to obtain data to be scraped.
- the DOM node information is spliced according to a preset splicing method to obtain data to be crawled.
- the data information can be spliced in the order from top to bottom or the data information can be spliced in the order from left to right.
- step S501 step S502, step S503, and step S504, an example is described below, and the specific expression is as follows:
- the second webpage is a predefined weather webpage, weather webcast, and the Chinese name is weather forecast
- remove the ⁇ html> tag information and ⁇ / html> tag information of the weather webpage and get ⁇ head> ⁇ title>
- Shenzhen ⁇ / title> ⁇ / head> ⁇ body> ⁇ h1> will have rain ⁇ / h1> ⁇ p> in the coming week ⁇ / p> ⁇ / body>
- ⁇ h1> will have rain ⁇ / h1> ⁇ p> in the coming week ⁇ / p>
- ⁇ h1> will have harain rain ⁇ / h1> and ⁇ p> in the coming week ⁇ / p>
- extract ⁇ title> Shenzhen ⁇ / title>, ⁇ h1> will have rain
- an XML document is obtained by removing tag information of a web page, and then the XML document is parsed to obtain a document object tree of the XML document.
- each text node information in the document object tree is extracted.
- the information of each text node is stitched to obtain the data that needs to be crawled.
- a data crawling device is provided, and the data crawling device corresponds to the data crawling method in the above embodiment one-to-one.
- the data crawling device includes a first access module 71, a first analysis module 72, a second access module 73, a traversal module 74, a second analysis module 75, a dispatch module 76, a first acquisition module 77, Third access module 78 and storage module 79.
- the detailed description of each function module is as follows:
- the first access module 71 is configured to access the first webpage corresponding to the preset first web address by using the network identification information in the identification channel, where the network identification information in the identification channel is allocated in advance by the identification information database, and the identification information database includes Multiple network identification information for successfully accessing network resources;
- the first parsing module 72 is configured to: if the first webpage corresponding to the first web address is successfully accessed using the network identification information in the identification channel, and the first web address is a non-domain name, perform a first parsing operation on the first web address according to a preset first parsing method. Parse to obtain the domain name corresponding to the first URL;
- the second access module 73 is configured to access the first page of the first website corresponding to the domain name by using the network identification information, wherein the first website includes more than one second web page, and the second web page includes the second web page content;
- the traversal module 74 is configured to successfully access the first webpage corresponding to the first URL using the network identification information in the identification channel, and the first URL is a domain name, or use the network identification information to successfully access the homepage of the first website corresponding to the domain name, then Traverse each second webpage of the first website;
- the second parsing module 75 is configured to parse the content of the second webpage according to a preset second parsing method if the traversal of each second webpage of the first website is successful, to obtain data that needs to be crawled.
- Assigning module 76 is used to access the first webpage corresponding to the first website using network identification information, or to access the first page of the first website corresponding to the domain name using network identification information, or to traverse each second webpage of the first website If it is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel and trigger the first access module 71.
- the new network identification information refers to network identification information that has not been assigned to the identification channel.
- the data crawling device further includes:
- a first obtaining module 77 configured to obtain network identification information on a second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;
- a third access module 78 configured to access the third webpage corresponding to the preset second web address by using the network identification information on the second website;
- the saving module 79 is configured to save the network identification information on the second website to the identification information database if the network identification information on the second website successfully accesses the third webpage corresponding to the preset second URL.
- the first access module 71 includes:
- a sending sub-module 711 configured to send an HTTP request to a server corresponding to a preset first URL by using network identification information
- the traversal module 74 includes:
- An acquisition tag submodule 741 configured to acquire each hyperlink tag of HTML in the first website, where the hyperlink tag includes more than one link target attribute;
- a first extraction submodule 742 configured to extract all link target attributes in each hyperlink tag
- the web page traversing submodule 743 is configured to traverse the second web page corresponding to each link target attribute by using the network identification information.
- the second parsing module 75 includes:
- the parsing document sub-module 752 is used to parse the XML document to obtain a document object tree in the XML document, where the document object tree contains more than one text node information;
- a second extraction submodule 753, configured to extract information of each text node in the document object tree
- the splicing sub-module 754 is configured to splice each text node information according to a preset splicing method to obtain data to be scraped.
- Each module in the above data crawling device may be implemented in whole or in part by software, hardware, and a combination thereof.
- the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
- a computer device is provided.
- the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8.
- the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
- the processor of the computer device is used to provide computing and control capabilities.
- the memory of the computer device includes a non-volatile storage medium and an internal memory.
- the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
- the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
- the database of the computer equipment is used to store data related to the data crawling method.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer-readable instructions are executed by a processor to implement a data scraping method.
- a computer device including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
- the processor executes the computer-readable instructions
- the data climbing of the foregoing embodiment is implemented. Take the steps of the method, for example, steps S10 to S60 shown in FIG. 2.
- the processor executes the computer-readable instructions
- the functions of the modules / units of the data crawling device in the foregoing embodiment are implemented, for example, the functions of modules 71 to 79 shown in FIG. 7. To avoid repetition, we will not repeat them here.
- a computer-readable storage medium is provided, the one or more non-volatile storage mediums storing computer-readable instructions, and the computer-readable instructions are executed by one or more processors.
- the data scraping method in the foregoing method embodiment is implemented, or the one or more non-volatile readable storage media storing computer-readable instructions are stored by a computer
- the read instruction is executed by one or more processors
- the function of each module / unit in the data crawling device in the foregoing device embodiment is implemented when the one or more processors execute computer-readable instructions. To avoid repetition, we will not repeat them here.
- Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory can include random access memory (RAM) or external cache memory.
- RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
A data crawling method and apparatus, and a computer device and a storage medium. The method comprises: accessing a first webpage by using network identification information; if the access is successful and the first website is a non-domain name, parsing a first website to obtain a domain name corresponding to the first website; accessing a homepage of the first website that corresponds to the domain name; if the access is successful and the first website is the domain name or the access to the homepage of the first website that corresponds to the domain name is successful, traversing all second webpages; if the access is successful, parsing the content of the second webpages to obtain data which needs to be crawled; if the access to the first webpage corresponding to the first website is unsuccessful or the access to the homepage of the first website that corresponds to the domain name is unsuccessful or the traverse of all the second webpages is unsuccessful, assigning new network identification information to an identification channel by using Tornado, and returning to execute the step of accessing the corresponding first webpage by using the network identification information, such that the stability of data crawling is improved.
Description
本申请以2018年06月11日提交的申请号为201810594254.9,名称为“数据爬取方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on a Chinese invention patent application filed on June 11, 2018 with application number 201810594254.9 and entitled "Data Crawling Method, Device, Computer Equipment, and Storage Medium" and claims its priority.
本申请涉及金融领域,尤其涉及一种数据爬取方法、装置、计算机设备及存储介质。The present application relates to the field of finance, and in particular, to a data crawling method, device, computer device, and storage medium.
目前,在金融行业中,数据信息对于金融公司越来越重要,金融公司通常需要通过网络向目标网站爬取大量有效信息。At present, in the financial industry, data information is becoming more and more important for financial companies. Financial companies usually need to crawl a large amount of effective information from the target website through the network.
传统的信息爬取方式为使用一个IP地址频繁对目标网站进行爬取,由于第一网站设置了反爬取模式,预设的时间段内限制一个IP地址对目标网站的访问次数,若是在预设的时间段内对第一网站的访问次数达到了预设的限定值,只能在下一个预设的时间段内再进行爬取,甚至目标网站将IP地址当做为恶意IP而封杀,从而导致爬取信息的稳定性低。The traditional information crawling method is to use an IP address to frequently crawl the target website. Because the first website has an anti-crawl mode, the number of visits to the target website by an IP address is limited within a preset period of time. The number of visits to the first website within the set time period reaches the preset limit, and crawling can only be performed within the next preset time period, and even the target website blocks the IP address as a malicious IP, resulting in The stability of crawling information is low.
发明内容Summary of the Invention
基于此,有必要针对上述技术问题,提供一种可以提高数据爬取稳定性低的数据爬取方法、装置、计算机设备及存储介质。Based on this, it is necessary to provide a data crawling method, device, computer equipment, and storage medium that can improve data crawling stability and have low stability in view of the above technical problems.
一种数据爬取方法,包括:采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页,其中,所述标识频道中的网络标识信息预先由标识信息库分派,所述标识信息库包括可成功访问网络资源的多个网络标识信息;若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为非域名,则按照预设的第一解析方式对所述第一网址进行解析,得到所述第一网址对应的域名;采用所述网络标识信息访问所述域名对应的第一网站的首页,其中,所述第一网站包括一个以上第二网页,所述第二网页包括第二网页内容;若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为域名,或者采用所述网络标识信息访问所述域名对应的第一网站的首页成功,则遍历所述第一网站的各个第二网页;若遍历所述第一网站的各个第二网页成功,按照预设的第二解析方式对所述第二网页内容进行解析,得到需要爬取的数据;若采用所述网络标识信息访问所述第一网址对应的第一网页不成功,或者采用所述网络标识信息访问所述域名对应的第一网站的首页不成功,或者遍历所述第一网站的各个所述第二网页不成功,则采用Tornado异步机制分派所述标识信息库中的新的网络标识信息至所述标识频道,返回执行所述采用所述标识频道中的网络标识信息访问预设的第一网址对应的第一网页的步骤,所述新的网络标识信息是指未分派过至所述标识频道的网络标识信息。A data crawling method includes: using network identification information in an identification channel to access a first webpage corresponding to a preset first URL, wherein the network identification information in the identification channel is pre-assigned by an identification information database, and The identification information database includes multiple network identification information that can successfully access network resources; if the network identification information in the identification channel is used to successfully access the first webpage corresponding to the first URL, and the first URL is not Domain name, the first URL is parsed according to a preset first resolution method to obtain the domain name corresponding to the first URL; and the network identification information is used to access the homepage of the first website corresponding to the domain name, where: The first website includes more than one second webpage, and the second webpage includes second webpage content; if the network identification information in the identification channel is used to access the first webpage corresponding to the first URL, and If the first URL is a domain name, or the first page of the first website corresponding to the domain name is successfully accessed using the network identification information, the first URL is traversed. Each second webpage of the website; if each second webpage of the first website is traversed successfully, the content of the second webpage is parsed according to a preset second parsing method to obtain data that needs to be crawled; if the It is unsuccessful to access the first webpage corresponding to the first web address by the network identification information, or to access the first page of the first website corresponding to the domain name using the network identification information, or to traverse each of the first websites of the first website. If the two webpages are unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and return to execute the access to the preset first URL using the network identification information in the identification channel. For the corresponding first webpage step, the new network identification information refers to network identification information that has not been assigned to the identification channel.
一种数据爬取装置,包括:第一访问模块,用于采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页,其中,所述标识频道中的网络标识信息预先由标识信息库分派,所述标识信息库包括可成功访问网络资源的多个网络标识信息;第一解析模块, 用于若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为非域名,则按照预设的第一解析方式对所述第一网址进行解析,得到所述第一网址对应的域名;第二访问模块,用于采用所述网络标识信息访问所述域名对应的第一网站的首页,其中,所述第一网站包括一个以上第二网页,所述第二网页包括第二网页内容;遍历模块,用于若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为域名,或者采用所述网络标识信息访问所述域名对应的第一网站的首页成功,则遍历所述第一网站的各个第二网页;第二解析模块,用于若遍历所述第一网站的各个第二网页成功,按照预设的第二解析方式对所述第二网页内容进行解析,得到需要爬取的数据;分派模块,用于若采用所述网络标识信息访问所述第一网址对应的第一网页不成功,或者采用所述网络标识信息访问所述域名对应的第一网站的首页不成功,或者遍历所述第一网站的各个所述第二网页不成功,则采用Tornado异步机制分派所述标识信息库中的新的网络标识信息至所述标识频道,并触发所述第一访问模块,所述新的网络标识信息是指未分派过至所述标识频道的网络标识信息。A data crawling device includes a first access module for accessing a first webpage corresponding to a preset first web address by using network identification information in an identification channel, wherein the network identification information in the identification channel is previously determined by Dispatched by an identification information base, the identification information base includes a plurality of network identification information that can successfully access network resources; a first parsing module, configured to access the first URL corresponding to the network identification information by using the network identification information in the identification channel The first webpage is successful, and the first URL is a non-domain name, the first URL is parsed according to a preset first parsing method to obtain a domain name corresponding to the first URL; a second access module uses And using the network identification information to access a homepage of a first website corresponding to the domain name, wherein the first website includes more than one second web page, and the second web page includes second web page content; a traversal module is configured to: Using the network identification information in the identification channel to access the first webpage corresponding to the first URL is successful, and the first URL is a domain name, or If the first page of the first website corresponding to the domain name is successfully accessed by using the network identification information, each second page of the first website is traversed; and a second parsing module is configured to traverse each second page of the first website if The webpage is successful, and the content of the second webpage is parsed according to a preset second parsing method to obtain data that needs to be crawled. A dispatching module is configured to access the first corresponding to the first URL if the network identification information is used. If the webpage is unsuccessful, or the homepage of the first website corresponding to the domain name is not successfully accessed using the network identification information, or each of the second webpages of the first website is not successfully traversed, the Tornado asynchronous mechanism is used to dispatch the The new network identification information in the identification information database is transmitted to the identification channel, and the first access module is triggered. The new network identification information refers to network identification information that has not been assigned to the identification channel.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述数据爬取方法的步骤。A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor implements the data scraping method when the processor executes the computer-readable instructions A step of.
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读存储介质存储有计算机可读指令,使得所述一个或多个处理器执行实现上述数据爬取方法的步骤。One or more non-volatile readable storage media storing computer readable instructions, the computer readable storage medium storing computer readable instructions, so that the one or more processors execute the data scraping method described above A step of.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权力要求变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1是本申请一实施例中数据爬取方法的一应用环境示意图;FIG. 1 is a schematic diagram of an application environment of a data crawling method according to an embodiment of the present application; FIG.
图2是本申请一实施例中数据爬取方法的一流程图;2 is a flowchart of a data crawling method according to an embodiment of the present application;
图3是本申请实施例提供的数据爬取方法中从代理网站中获取网络标识信息的实现流程图;FIG. 3 is a flowchart of an implementation of obtaining network identification information from a proxy website in a data crawling method provided by an embodiment of the present application; FIG.
图4是本申请实施例提供的数据爬取方法中步骤S10的实现流程图;4 is a flowchart of implementing step S10 in a data crawling method provided by an embodiment of the present application;
图5是本申请实施例提供的数据爬取方法中遍历各个网页的实现流程图;FIG. 5 is a flowchart of implementation of traversing each web page in a data crawling method provided by an embodiment of the present application; FIG.
图6是本申请实施例提供的数据爬取方法中解析网页内容的实现流程图;FIG. 6 is a flowchart of implementation of parsing webpage content in a data crawling method provided by an embodiment of the present application; FIG.
图7是本申请一实施例中数据爬取装置的一示意图;7 is a schematic diagram of a data crawling device according to an embodiment of the present application;
图8是本申请一实施例中计算机设备的一示意图。FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.
本申请提供的数据爬取方法,可应用在如图1的应用环境中,其中,计算机设备通过网络与服务器进行通信。首先服务端采用标识频道中预先由标识信息库分派得来的网络标识信息访问预设的第一网址对应的客户端的第一网页,若服务端采用标识频道中的网络标识信息访问第一网址对应的客户端的第一网页成功,且第一网址为非域名,则服务端按照预设的第一解析方式对第一网址进行解析,从而可以得到第一网址对应的域名,然后,服务端采用网络标识信息访问域名对应的第一网站的客户端的首页,若服务端采用标识频道中的网络标识信息访问第一网址对应的客户端的第一网页成功,且第一网址为域名或采用网络标识信息访问域名对应的第一网站的客户端的首页成功,则服务端遍历第一网站的客户端的各个第二网页,接下来,服务端确定遍历第一网站的客户端的各个第二网页成功后,服务端按照预设的第二解析方式对客户端的第二网页内容进行解析,得到需要爬取的数据,最后,若服务端采用网络标识信息访问第一网址对应的客户端的第一网页不成功,或者采用网络标识信息访问域名对应的第一网站的客户端的首页不成功,或者遍历第一网站的客户端的各个第二网页不成功,则服务端采用Tornado异步机制分派标识信息库中的新的网络标识信息至标识频道,服务端返回执行采用标识频道中的网络标识信息访问预设的第一网址对应的客户端的第一网页的步骤。其中,计算机设备可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The data crawling method provided in this application can be applied in the application environment shown in FIG. 1, where a computer device communicates with a server through a network. First, the server accesses the first webpage of the client corresponding to the preset first URL by using the network identification information allocated in advance by the identification information database in the identification channel. If the server uses the network identification information in the identification channel to access the first URL, If the first webpage of the client is successful and the first URL is a non-domain name, the server parses the first URL according to a preset first parsing method, so as to obtain the domain name corresponding to the first URL. Then, the server uses the network The identification information accesses the homepage of the client of the first website corresponding to the domain name. If the server successfully uses the network identification information in the identification channel to access the first webpage of the client corresponding to the first URL, and the first URL is a domain name or is accessed using network identification information If the first page of the client of the first website corresponding to the domain name is successful, the server traverses the second pages of the client of the first website. Next, the server determines that the second pages of the client of the first website are traversed successfully. The preset second parsing method parses the content of the second webpage of the client, so that Data that needs to be crawled. Finally, if the server uses network identification information to access the first webpage of the client corresponding to the first URL, or the network identification information to access the first page of the client's first website corresponding to the domain name is unsuccessful, or iterates Each second page of the client of the first website is unsuccessful, the server uses the Tornado asynchronous mechanism to assign new network identification information in the identification information database to the identification channel, and the server returns to execute the preset access using the network identification information in the identification channel. The first URL corresponds to the first web page of the client. Among them, the computer device may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种数据爬取方法,该数据爬取方法应用在金融行业中,以该方法应用在图1中的服务端为例进行说明,包括如下步骤:In an embodiment, as shown in FIG. 2, a data crawling method is provided. The data crawling method is applied in the financial industry. The method is applied to the server in FIG. 1 as an example for description, and includes the following steps. :
S10:采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页;S10: Use the network identification information in the identification channel to access the first webpage corresponding to the preset first URL;
在本申请实施例中,标识频道是指临时存储可成功访问网络资源的网络标识信息的信道。网络标识信息是指机器在网络中的标志信息,即IP地址。IP地址,英文全称为Internet Protocol Address,是指互联网协议地址。标识信息库是指专门用于保存可成功访问网络资源的网络标识信息的数据库。标识频道中的网络标识信息预先由标识信息库分派,标识信息库包括可成功访问网络资源的多个网络标识信息。In the embodiment of the present application, the identification channel refers to a channel that temporarily stores network identification information that can successfully access network resources. The network identification information refers to the identification information of the machine in the network, that is, the IP address. IP address, the English name is Internet Protocol Address, refers to the Internet Protocol address. The identification information database refers to a database specifically used to store network identification information that can successfully access network resources. The network identification information in the identification channel is assigned in advance by an identification information base, and the identification information base includes a plurality of network identification information that can successfully access network resources.
需要说明的是,标识频道可以有多个,一个标识频道存在一个正在使用的可成功访问网络资源的网络标识信息。正在使用的可成功访问网络资源的网络标识信息可以受到外界限制而变得无效,即正在使用的可成功访问网络资源的网络标识信息可以受到一个网站的封杀而变得无法再访问该网站。It should be noted that there may be multiple identification channels, and one identification channel has one piece of network identification information that can be used to successfully access network resources. The network identification information that can be used to successfully access network resources can be invalidated by external restrictions, that is, the network identification information that can be used to successfully access network resources can be blocked by a website and can no longer access the website.
具体地,首先,将上网设备的IP地址设置为临时存储可成功访问网络资源的IP地址的信道中的IP地址,然后,采用该上网设备中的浏览器访问预设的第一网址对应的第一网页,其中,该信道中的IP地址预先由标识信息库分派得来,标识信息库包括多个可成功访问网络资源的IP地址。Specifically, first, an IP address of an Internet access device is set to an IP address in a channel that temporarily stores an IP address that can successfully access network resources, and then a browser in the Internet access device is used to access a first URL corresponding to a preset first URL. A webpage, wherein the IP address in the channel is assigned in advance from an identification information base, and the identification information base includes a plurality of IP addresses that can successfully access network resources.
需要说明的是,预设的第一网址可以为http://www.xinhuanet.com/fortune/2018-02/08/c_129808453.html,预设的第一网址的具体内容,可以根据实际应用需要进行设定,此处不做限制。It should be noted that the preset first URL can be http://www.xinhuanet.com/fortune/2018-02/08/c_129808453.html, and the specific content of the preset first URL can be based on actual application requirements. Make settings, there is no restriction here.
S20:若采用标识频道中的网络标识信息访问第一网址对应的第一网页成功,且第一网址为非域名,则按照预设的第一解析方式对第一网址进行解析,得到第一网址对应的域名;S20: If the access to the first webpage corresponding to the first URL using the network identification information in the identification channel is successful and the first URL is a non-domain name, the first URL is parsed according to a preset first parsing method to obtain the first URL Corresponding domain name;
在本申请实施例中,域名,英文全称为Domain Name,是指由一串用点分隔的名字组成的Internet上某一台计算机或计算机组的名称,用于在数据传输时标识计算机的电子方位。Internet,中文名称为因特网,是指由那些使用公用语言互相通信的计算机连接而成的全球网络。In the examples of this application, the domain name, the English name is Domain Name, refers to the name of a computer or computer group on the Internet consisting of a series of names separated by dots, used to identify the electronic position of the computer during data transmission . The Internet, the Chinese name for the Internet, refers to a global network of computers connected to each other using a common language.
具体地,若采用IP地址为标识频道中的IP地址的上网设备中的浏览器,访问第一网址对应的第一网页成功,且第一网址为非由一串用点分隔的名字组成的Internet上某一台 计算机或计算机组的名称,则按照预设的第一解析方式对第一网址进行解析,得到第一网址对应的域名。Specifically, if a browser in an Internet access device that uses an IP address as an IP address in a channel is accessed, the first webpage corresponding to the first web address is successfully accessed, and the first web address is an Internet not composed of a series of names separated by dots. The name of a certain computer or computer group is parsed according to a preset first resolution method to obtain a domain name corresponding to the first URL.
需要说明的是,预设的第一解析方式可以为直接提取按照由左到右顺序排列的一个网址中的双斜杠“//”与第一单斜杠“/”之间的内容,预设的第一解析方式的具体内容,可以根据实际应用需要进行设定,此处不做限制。It should be noted that the preset first parsing method may be to directly extract the content between the double slash "//" and the first single slash "/" in a URL arranged in order from left to right. The specific content of the first analysis method can be set according to actual application requirements, and is not limited here.
为了更好地理解步骤S20,下面通过一个例子进行说明,具体表述如下:In order to better understand step S20, an example is described below, and the specific expression is as follows:
例如,假设上网设备为个人计算机,浏览器为IE浏览器,标识频道为A频道,IP地址为“42.55.173.190,端口80”,第一网址为http://news.163.com/18/0130/12/D9DA7M9S000181BT.html,预设的第一解析方式为直接提取一个网址中的双斜杠“//”与第一单斜杠“/”之间的内容,则,若采用IP地址为“42.55.173.190,端口80”的个人计算机中的IE浏览器访问http://news.163.com/18/0130/12/D9DA7M9S000181BT.html对应的第一网页成功,且http://news.163.com/18/0130/12/D9DA7M9S000181BT.html为不是由一串用点分隔的名字组成的Internet上某一台计算机或计算机组的名称,则直接提取按照由左到右顺序排列的http://news.163.com/18/0130/12/D9DA7M9S000181BT.html中的双斜杠“//”与第一单斜杠“/”之间的内容,得到http://news.163.com/18/0130/12/D9DA7M9S000181BT.html对应的news.163.com。For example, suppose the Internet access device is a personal computer, the browser is Internet Explorer, the identification channel is channel A, the IP address is "42.55.173.190, port 80", and the first URL is http://news.163.com/18/ 0130/12 / D9DA7M9S000181BT.html, the default first parsing method is to directly extract the content between the double slash "//" and the first single slash "/" in a URL. If the IP address is The IE browser in the personal computer of "42.55.173.190, port 80" accesses the first webpage corresponding to http://news.163.com/18/0130/12/D9DA7M9S000181BT.html, and http: // news. 163.com/18/0130/12/D9DA7M9S000181BT.html is the name of a computer or computer group on the Internet that is not composed of a series of dot-separated names, and it directly extracts http from left to right: //news.163.com/18/0130/12/D9DA7M9S000181BT.html The content between the double slash "//" and the first single slash "/" is http://news.163.com /18/0130/12/D9DA7M9S000181BT.html corresponds to news.163.com.
S30:采用网络标识信息访问域名对应的第一网站的首页,其中,第一网站包括一个以上第二网页,第二网页包括第二网页内容;S30: Use the network identification information to access the homepage of the first website corresponding to the domain name, where the first website includes more than one second web page, and the second web page includes the second web page content;
具体地,首先,将上网设备的IP地址设置为标识频道中的IP地址,然后,采用该上网设备中的浏览器访问域名对应的第一网站的首页,其中,第一网站包括一个以上第二网页,第二网页包括第二网页内容。Specifically, first, the IP address of the Internet access device is set to the IP address in the identification channel, and then the browser on the Internet access device is used to access the homepage of the first website corresponding to the domain name, where the first website includes more than one second Webpage, the second webpage includes the content of the second webpage.
需要说明的是,第一网站可以为网易新闻官网,第二网页包括第一网页,第一网站的具体内容,可以根据实际应用需要进行设定,此处不做限制。It should be noted that the first website may be the official website of NetEase News, and the second website includes the first website. The specific content of the first website may be set according to actual application requirements, and is not limited here.
S40:若采用标识频道中的网络标识信息访问第一网址对应的第一网页成功,且第一网址为域名,或者采用网络标识信息访问域名对应的第一网站的首页成功,则遍历第一网站的各个第二网页;S40: If the first page corresponding to the first URL is successfully accessed using the network identification information in the identification channel, and the first URL is a domain name, or the first page of the first website corresponding to the domain name is accessed using the network identification information, the first website is traversed Each second page of
具体地,若采用IP地址为标识频道中的IP地址的上网设备中的浏览器访问第一网址对应的第一网页成功,且第一网址为域名,或者采用IP地址访问域名对应的第一网站的首页成功,则遍历第一网站的各个第二网页。Specifically, if the browser in the Internet access device that uses the IP address as the IP address in the identification channel successfully accesses the first webpage corresponding to the first URL, and the first URL is a domain name, or uses the IP address to access the first website corresponding to the domain name Successful homepage, iterates through the second pages of the first website.
S50:若遍历第一网站的各个第二网页成功,按照预设的第二解析方式对第二网页内容进行解析,得到需要爬取的数据;S50: If it is successful to traverse each second webpage of the first website, analyze the content of the second webpage according to a preset second parsing method to obtain data that needs to be crawled;
具体地,若采用IP地址为标识频道中的IP地址的上网设备中的浏览器遍历第一网站的各个第二网页成功,按照预设的第二解析方式对第二网页内容进行解析,得到需要爬取的数据。Specifically, if the browser in the Internet access device that uses the IP address as the IP address in the identification channel successfully traverses each second webpage of the first website, the content of the second webpage is parsed according to a preset second parsing method, and the required Crawled data.
需要说明的是,预设的第二解析方式可以为采用JAXP工具对一个网页进行解析。JAXP工具是指对XML文档处理的工具。JAXP,英文全称为Java API for XML Processing,是指解析和验证XML文档的Java应用程序接口。XML文档是指一种用于标记电子文件使其具有结构性的标记语言文档。Java是指一门面向对象编程语言。预设的第二解析方式的具体内容,可以根据实际应用需要进行设定,此处不做限制。It should be noted that the preset second parsing method may be parsing a webpage using a JAXP tool. JAXP tools are tools for processing XML documents. JAXP, the English name for Java API for XML Processing, refers to a Java application program interface for parsing and validating XML documents. An XML document is a markup language document that is used to mark electronic files to make them structured. Java refers to an object-oriented programming language. The specific content of the preset second analysis method may be set according to actual application requirements, and is not limited here.
S60:若采用网络标识信息访问第一网址对应的第一网页不成功,或者采用网络标识信息访问域名对应的第一网站的首页不成功,或者遍历第一网站的各个第二网页不成功,则采用Tornado异步机制分派标识信息库中的新的网络标识信息至标识频道,返回执行步骤S10,新的网络标识信息是指未分派过至标识频道的网络标识信息。S60: If it is unsuccessful to use the network identification information to access the first webpage corresponding to the first URL, or to use the network identification information to access the homepage of the first website corresponding to the domain name, or to traverse each second webpage of the first website, The Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and the process returns to step S10. The new network identification information refers to network identification information that has not been assigned to the identification channel.
在本申请实施例中,Tornado是一种Web服务器软件的开源版本。Web,英文全称为 World Wide Web,是指一种基于超文本和HTTP的、全球性的、动态交互的、跨平台的分布式图形信息系统。HTTP英文全称为HyperText Transfer Protocol,是指超文本传输协议,是互联网上应用最为广泛的一种网络协议,是一种详细规定浏览器和万维网服务器之间相互通信的规则。Tornado异步机制是指一种当一个异步过程调用发出后,调用者不能立刻得到结果,实际处理这个调用的部件在完成后,通过状态、通知和回调来通知调用者的机制。In the embodiment of the present application, Tornado is an open source version of Web server software. The Web, which is called World Wide Web in English, refers to a distributed graphical information system based on hypertext and HTTP, global, dynamic interaction, and cross-platform. HTTP in English is called HyperText Transfer Protocol. It refers to the Hypertext Transfer Protocol. It is the most widely used network protocol on the Internet. It is a rule that specifies the communication between the browser and the World Wide Web server in detail. Tornado asynchronous mechanism refers to a mechanism that when an asynchronous procedure call is issued, the caller cannot get the result immediately. After the component that actually handles the call completes, it informs the caller through the status, notification, and callback.
进一步地,Tornado异步机制基于AsyncHTTPClient实现,AsyncHTTPClient是指一种使用线程池处理和发送请求的异步框架。Further, the Tornado asynchronous mechanism is implemented based on AsyncHTTPClient. AsyncHTTPClient refers to an asynchronous framework that uses a thread pool to process and send requests.
具体地,若采用IP地址为标识频道中的IP地址的上网设备中的浏览器访问第一网址对应的第一网页不成功,或者采用IP地址为标识频道中的IP地址的上网设备中的浏览器访问域名对应的第一网站的首页不成功,或者采用IP地址为标识频道中的IP地址的上网设备中的浏览器遍历第一网站的各个第二网页不成功,则采用基于AsyncHTTPClient实现的Tornado异步机制分派标识信息库中的新的IP地址至标识频道,返回执行步骤S10,新的IP地址是指未分派过至标识频道的IP地址。Specifically, if the browser on the Internet device using the IP address as the IP address in the identification channel fails to access the first webpage corresponding to the first URL, or the browsing on the Internet device using the IP address as the IP address in the identification channel is not successful If the browser fails to access the homepage of the first website corresponding to the domain name, or the browser on the Internet device that uses the IP address as the IP address in the identification channel fails to traverse each second page of the first website, Tornado based on AsyncHTTPClient is used. The asynchronous mechanism assigns a new IP address in the identification information database to the identification channel, and returns to step S10. The new IP address refers to an IP address that has not been assigned to the identification channel.
为了更好地理解步骤S60,下面通过一个例子进行说明,具体表述如下:In order to better understand step S60, an example is described below, and the specific expression is as follows:
例如,假设上网设备为个人计算机,浏览器为IE浏览器,IP地址包括“42.55.173.190,端口80”和“53.34.219.40,端口8118”,第一网址为http://news.163.com/18/0130/12/D9DA7M9S000181BT.html,域名为news.163.com,新的IP地址为“121.31.100.15,端口8123”,标识信息库为第一mysql数据库,标识频道包括A信道和B信道,则,若采用IP地址为A信道中的“42.55.173.190,端口80”的上网设备中的IE浏览器访问http://news.163.com/18/0130/12/D9DA7M9S000181BT.html对应的第一网页不成功,或者采用IP地址为A信道中的“42.55.173.190,端口80”的上网设备中的IE浏览器访问news.163.com对应的第一网站的首页不成功,或者采用IP地址为A信道中的“42.55.173.190,端口80”的上网设备中的IE浏览器遍历第一网站的各个第二网页不成功,则不需要B信道中的“53.34.219.40,端口8118”返回,便采用基于AsyncHTTPClient实现的Tornado异步机制分派第一mysql数据库中的“121.31.100.15,端口8123”至A信道,返回执行步骤S10,“121.31.100.15,端口8123”是指未分派过至A信道的IP地址。For example, assuming that the Internet device is a personal computer and the browser is Internet Explorer, the IP address includes "42.55.173.190, port 80" and "53.34.219.40, port 8118". The first URL is http://news.163.com /18/0130/12/D9DA7M9S000181BT.html, the domain name is news.163.com, the new IP address is "121.31.100.15, port 8123", the identification information database is the first mysql database, and the identification channel includes A channel and B channel , Then, if the IE browser in the Internet device with the IP address of "42.55.173.190, port 80" in the A channel is accessed, access the corresponding http://news.163.com/18/0130/12/D9DA7M9S000181BT.html The first web page is unsuccessful, or the IE browser on the Internet device with an IP address of "42.55.173.190, port 80" is not successful in accessing the home page of the first website corresponding to news.163.com, or using IP The IE browser on the Internet device with the address "42.55.173.190, port 80" in the A channel is not successful in traversing the second pages of the first website, so it is not necessary to return the "53.34.219.40, port 8118" in the B channel. , Using Tornado asynchronous machine based on AsyncHTTPClient Assigning a first mysql database "121.31.100.15, port 8123" to the A channel, returns to step S10, "121.31.100.15, port 8123" refers to the IP address is not assigned through the A channel.
在图2对应的实施例中,首先,通过采用标识频道中预先分派得来的网络标识信息访问预设的第一网址对应的第一网页,若访问第一网页成功,且第一网址为非域名,则对第一网址进行解析,从而可以得到第一网址对应的域名,然后,采用网络标识信息访问域名对应的第一网站的首页,若访问第一网页成功,且第一网址为域名或采用访问首页成功,则遍历第一网站的各个第二网页,接下来,确定遍历各个第二网页成功后,按照对第二网页内容进行解析,得到需要爬取的数据,最后,若访问第一网页不成功,或者访问首页不成功,或者遍历各个第二网页不成功,则采用Tornado异步机制分派新的网络标识信息至标识频道,返回执行步骤S10,从而当正在使用的其中一个网络标识信息失效时,便立即分派一个新的网络标识信息,由于新的网络标识信息来自标识信息库中,而标识信息库中的各个网络标识信息是可成功访问网络资源的网络标识信息,从而确保了网络标识信息的稳定性,保证可以正常有序地访问网络资源,进而提高了数据爬取的稳定性和效率。In the embodiment corresponding to FIG. 2, first, the first web page corresponding to the preset first web address is accessed by using the network identification information pre-assigned in the identification channel. If the first web page is accessed successfully, and the first web address is not Domain name, the first URL is parsed to obtain the domain name corresponding to the first URL, and then the network identification information is used to access the home page of the first website corresponding to the domain name. If the first web page is successfully accessed, and the first URL is a domain name or If the first page is accessed successfully, each second page of the first website is traversed. Next, after determining that each second page is successfully traversed, the content of the second page is parsed to obtain data to be crawled. Finally, if the first page is accessed If the webpage is unsuccessful, or the homepage is unsuccessful, or the traversal of each second webpage is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information to the identification channel and return to step S10, so that when one of the network identification information in use is invalid A new network identification information is immediately assigned, since the new network identification information comes from the target In the information database, each network identification information in the identification information database is network identification information that can successfully access network resources, thereby ensuring the stability of the network identification information, ensuring normal and orderly access to network resources, and thereby improving data crawling. Take the stability and efficiency.
在一实施例中,该数据爬取方法应用在金融行业中。如图3所示,在步骤S10中,即采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页之前,该数据爬取方法还包括:In one embodiment, the data crawling method is applied in the financial industry. As shown in FIG. 3, in step S10, that is, before using the network identification information in the identification channel to access the first webpage corresponding to the preset first URL, the data crawling method further includes:
S70:从第二网站对应的网页中获取第二网站上的网络标识信息,其中,第二网站存在有一个以上网络标识信息;S70: Obtain network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;
具体地,按照预设的提取方式从第二网站对应的网页中提取第二网站上的IP地址,其中,第二网站存在有一个以上IP地址。Specifically, an IP address on the second website is extracted from a webpage corresponding to the second website according to a preset extraction manner, where the second website has more than one IP address.
需要说明的是,预设的提取方式可以为复制或截图,第二网站可以为西刺代理网站,西刺代理网站是指一个专门提供国内外IP地址的网站。预设的提取方式和第二网站的具体内容,可以根据实际应用需要进行设定,此处不做限制。It should be noted that the preset extraction method can be copying or screenshot, the second website can be the West Thorn proxy website, and the West Thorn proxy website refers to a website that provides domestic and international IP addresses. The preset extraction method and the specific content of the second website can be set according to actual application requirements, and there is no limitation here.
S80:采用第二网站上的网络标识信息访问预设的第二网址对应的第三网页;S80: Use the network identification information on the second website to access the third webpage corresponding to the preset second URL;
具体地,首先,将上网设备的IP地址设置为在第二网站上提取到的IP地址,然后,采用该上网设备中的浏览器访问预设的第二网址对应的第三网页。Specifically, first, the IP address of the Internet access device is set to the IP address extracted on the second website, and then the browser in the Internet access device is used to access the third webpage corresponding to the preset second URL.
需要说明的是,第二网址可以为http://www.xinhuanet.com/,第二网址的具体内容,可以根据实际应用需要进行设定,此处不做限制。It should be noted that the second URL may be http://www.xinhuanet.com/, and the specific content of the second URL may be set according to actual application requirements, and is not limited here.
S90:若第二网站上的网络标识信息访问预设的第二网址对应的第三网页成功,则将第二网站上的网络标识信息保存到标识信息库中;S90: If the network identification information on the second website successfully accesses the third webpage corresponding to the preset second URL, save the network identification information on the second website to the identification information database;
具体地,若采用IP地址为在第二网站上提取到的IP地址的上网设备中的浏览器访问预设的第二网址对应的第三网页成功,则将在第二网站上提取到的IP地址保存到标识信息库中;若采用IP地址为在第二网站上提取到的IP地址的上网设备中的浏览器访问预设的第二网址对应的第三网页不成功,则将在第二网站上提取到的IP地址保存到无效数据库中。Specifically, if the browser on the Internet device using the IP address extracted from the second website successfully accesses the third webpage corresponding to the preset second URL, the extracted IP from the second website will be The address is stored in the identification information database; if the browser on the Internet device whose IP address is the IP address extracted on the second website fails to access the third webpage corresponding to the preset second URL, it will be in the second The IP address extracted from the website is saved to an invalid database.
需要说明的是,预设的第二网址为可正常连接的网址。It should be noted that the preset second URL is a URL that can be connected normally.
在图3对应的实施例中,通过获取网站上的网络标识信息,采用网络标识信息访问预设网址对应的网页,若访问成功,则将该网络标识信息保存到标识信息库中,从而能够通过互联网络从网络标识信息代理网站中获取到世界各地的网络标识信息,进而提高了获取网络标识信息的便捷性。In the embodiment corresponding to FIG. 3, the network identification information on the website is obtained, and the network identification information is used to access the webpage corresponding to the preset URL. If the access is successful, the network identification information is saved in the identification information database, so that The Internet obtains network identification information from all over the world from the network identification information agency website, thereby improving the convenience of obtaining network identification information.
在一实施例中,该数据爬取方法应用在金融行业中。如图4所示,步骤S10中,即采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页具体包括如下步骤:In one embodiment, the data crawling method is applied in the financial industry. As shown in FIG. 4, in step S10, that is, using the network identification information in the identification channel to access the first webpage corresponding to the preset first URL specifically includes the following steps:
S101:采用网络标识信息向预设的第一网址对应的服务器发送HTTP请求;S101: Use network identification information to send an HTTP request to a server corresponding to a preset first URL;
在本申请实施例中,HTTP请求是指从客户端到服务器端的请求消息。In the embodiment of the present application, the HTTP request refers to a request message from a client to a server.
具体地,首先,将上网设备的IP地址设置为标识频道中的IP地址,然后,采用该上网设备中的浏览器向预设的第一网址对应的服务器发送HTTP请求,其中,HTTP请求包括目标资源的标识信息,目标资源的标识信息唯一标识目标资源。Specifically, first, the IP address of the Internet access device is set to the IP address in the identification channel, and then a browser in the Internet access device is used to send an HTTP request to a server corresponding to a preset first URL, where the HTTP request includes a target The identification information of the resource, and the identification information of the target resource uniquely identifies the target resource.
需要说明的是,当服务器接收到HTTP请求时,则对HTTP请求中目标资源的标识信息进行校验,校验通过后,向发送方反馈目标资源的HTML文件。HTML文件,是指可以被多种网页浏览器读取,产生网页传递各类资讯的文件。It should be noted that when the server receives the HTTP request, it verifies the identification information of the target resource in the HTTP request. After the verification is passed, the HTML file of the target resource is fed back to the sender. An HTML file is a file that can be read by a variety of web browsers to generate various types of information on a web page.
S102:若接收到服务器根据HTTP请求反馈的HTML文件,则确定采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页成功;S102: If the HTML file fed back by the server according to the HTTP request is received, it is determined that access to the first webpage corresponding to the preset first URL using the network identification information in the identification channel is successful;
具体地,若接收到预设的第一网址对应的服务器反馈的校验通过的HTTP请求中的目标资源的HTML文件,则确定采用标识频道中的IP地址访问预设的第一网址对应的第一网页成功;若不接收到预设的第一网址对应的服务器反馈的校验通过的HTTP请求中的目标资源的HTML文件,则确定采用标识频道中的IP地址访问预设的第一网址对应的第一网页不成功。Specifically, if an HTML file of the target resource in the HTTP request that passes the verification feedback from the server corresponding to the preset first URL is received, it is determined that the IP address in the identification channel is used to access the first URL corresponding to the preset first URL. A webpage is successful; if the HTML file of the target resource in the HTTP request that is passed by the server corresponding to the preset first URL is not received, it is determined that the IP address in the identification channel is used to access the preset first URL. The first page was unsuccessful.
在图4对应的实施例中,通过采用网络标识信息向预设的网址对应的服务器发送HTTP请求,若接收到服务器反馈的HTML文件,则确定访问预设的网址对应的网页成功,从而可以通过预览了解到需要爬取的网页内容的量的大小,根据需要爬取的网页内容的量的大小预测爬取网页内容需要的时间,从而可以提前知道爬取需要的数据所完成的时间,进而可以保证数据爬取的进度。In the embodiment corresponding to FIG. 4, an HTTP request is sent to a server corresponding to a preset web address by using network identification information. If an HTML file returned by the server is received, it is determined that access to a web page corresponding to the preset web address is successful, so that Preview knows the amount of web content that needs to be crawled, and predicts the time required to crawl the web content based on the amount of web content that needs to be crawled, so that you can know in advance the time required to complete the data crawling, and then you can Ensure the progress of data crawling.
在一实施例中,该数据爬取方法应用在金融行业中。如图5所示,步骤S40中,即遍历第一网站的各个第二网页具体包括如下步骤:In one embodiment, the data crawling method is applied in the financial industry. As shown in FIG. 5, in step S40, traversing each second webpage of the first website specifically includes the following steps:
S401:获取第一网站中的HTML的各个超链接标签;S401: Obtain each hyperlink tag of HTML in the first website;
在本申请实施例中,HTML,中文名称为超文本标记语言,是指页面内可以包含图片、链接,甚至音乐、程序等非文字元素的文本。a标签,是指HTML语言标签,所述a标签为超链接标签,用于从一张页面链接到另一张页面。超链接标签包括一个以上链接目标属性,链接目标属性为href属性,是指定超链接目标的URL。URL,英文全称为Uniform Resource Locator,中文名称为统一资源定位符,是指互联网上标准资源的地址,是对可以从互联网上得到的资源的位置和访问方法的一种简洁的表示。In the embodiments of the present application, HTML and Chinese name is Hypertext Markup Language, which means text on a page that can contain pictures, links, and even non-text elements such as music and programs. The a tag refers to an HTML language tag. The a tag is a hyperlink tag and is used to link from one page to another. Hyperlink tags include more than one link target attribute. The link target attribute is the href attribute, which is the URL that specifies the hyperlink target. URL, full English name Uniform Resource Locator, Chinese name is Uniform Resource Locator, refers to the address of a standard resource on the Internet, and is a concise representation of the location and access method of a resource that can be obtained from the Internet.
具体地,在第一网站中的HTML页面中,提取各个a标签,其中,一个a标签包括一个以上href属性。Specifically, in the HTML page in the first website, each a tag is extracted, where one a tag includes more than one href attribute.
S402:提取各个超链接标签中的所有链接目标属性;S402: Extract all link target attributes in each hyperlink tag;
具体地,在第一网站中的HTML页面中的各个a标签中提取各个href属性。Specifically, each href attribute is extracted from each a tag in the HTML page in the first website.
S403:采用网络标识信息遍历各个链接目标属性对应的第二网页;S403: Traverse the second webpage corresponding to each link target attribute by using the network identification information;
具体地,首先,将上网设备的IP地址设置为标识频道中的IP地址,然后,采用该上网设备中的浏览器,遍历第一网站中的HTML页面中的各个a标签中的各个href属性对应的第二网页。Specifically, first, set the IP address of the Internet access device to the IP address in the identification channel, and then use the browser in the Internet access device to traverse each href attribute corresponding to each a tag in the HTML page on the first website. Second page.
为了更好地理解步骤S401、步骤S402和步骤S403,下面通过一个例子进行说明,具体表述如下:In order to better understand step S401, step S402, and step S403, an example is described below, and the specific expression is as follows:
例如,假设第一网站为新华网,HTML的一个a标签为<a href="http://mongolian.news.cn/"target="_blank"title=""></a>,上网设备为个人计算机,标识频道为D频道,IP地址为“219.149.46.151,端口3129”,浏览器为IE浏览器,则首先,在新华网上的HTML页面中获取<a href="http://mongolian.news.cn/"target="_blank"title=""></a>,然后,提取<a href="http://mongolian.news.cn/"target="_blank"title=""></a>中的http://mongolian.news.cn/,将个人计算机的IP地址设置为D频道中的“219.149.46.151,端口3129”,最后,采用该个人计算机中的IE浏览器,遍历新华网中的HTML页面中的<a href="http://mongolian.news.cn/"target="_blank"title=""></a>中的http://mongolian.news.cn/对应的第二网页。For example, assuming the first website is Xinhuanet, an a tag in HTML is <a href="http://mongolian.news.cn/"target=""blank"title=""> </a>, and the internet device is Personal computer, the identification channel is D channel, the IP address is "219.149.46.151, port 3129", and the browser is IE browser. First, get <ahref = "" http: // mongolian. news.cn/"target = "" blank "title =" "> </a>, then extract <ahref =" "http://mongolian.news.cn/" target = "" blank "title =" "> < / a> in http://mongolian.news.cn/, set the personal computer's IP address to "219.149.46.151, port 3129" in the D channel, and finally, use the IE browser in the personal computer to traverse <a href="http://mongolian.news.cn/"target=""blank"title=""> </a> in the HTML page of Xinhuanet.com http://mongolian.news.cn/ Corresponding second page.
在图5对应的实施例中,通过获取网站中的HTML的各个a标签,提取a标签中的所有href属性,遍历各个href属性对应的网页,通过遍历各个网页,从而能够实现不遗漏地浏览网页内容,进而提高了浏览需要爬取的数据的全面性。In the embodiment corresponding to FIG. 5, by obtaining each a tag of the HTML in the website, extracting all href attributes in the a tag, traversing the webpage corresponding to each href attribute, and traversing each webpage, it is possible to browse the webpage without omission. Content, which in turn improves the comprehensiveness of the data you need to crawl.
在一实施例中,该数据爬取方法应用在金融行业中。如图6所示,步骤S50中,即按照预设的第二解析方式对第二网页内容进行解析,得到需要爬取的数据具体包括如下步骤:In one embodiment, the data crawling method is applied in the financial industry. As shown in FIG. 6, in step S50, the content of the second webpage is parsed according to a preset second parsing method, and the data to be crawled specifically includes the following steps:
S501:去除第二网页的标签信息,得到XML文档。S501: Remove the tag information of the second webpage to obtain an XML document.
具体地,去除第二网页的<html>标签信息和</html>标签信息,得到XML文档,其中,<html>标签信息和</html>标签信息是指HTML标签,HTML标签是指超文本标记语言标记标签。Specifically, the <html> tag information and </ html> tag information of the second webpage are removed to obtain an XML document. The <html> tag information and </ html> tag information refer to HTML tags, and the HTML tags refer to hypertext. Markup language markup tags.
S502:解析XML文档,得到XML文档中的文档对象树。S502: Parse the XML document to obtain a document object tree in the XML document.
在本申请实施例中,文档对象树是指由Document对象构建成的树。Document对象是指浏览器窗口中网页的文档。文档对象树包含一个以上文本节点信息。In the embodiment of the present application, the document object tree refers to a tree constructed by a Document object. The Document object refers to the document of the web page in the browser window. The document object tree contains more than one text node information.
具体地,解析XML文档,得到XML文档中的文档对象树,其中,文档对象树包含一个以上DOM节点信息。Specifically, the XML document is parsed to obtain a document object tree in the XML document, where the document object tree includes more than one DOM node information.
需要说明的是,DOM节点信息是指在XML文档中的DOM对象,DOM对象是指以层次结构组织的节点或信息片断的集合。It should be noted that DOM node information refers to a DOM object in an XML document, and a DOM object refers to a collection of nodes or information pieces organized in a hierarchical structure.
S503:提取文档对象树中的各个文本节点信息。S503: Extract each text node information in the document object tree.
在本实施例中,文本节点信息为DOM节点信息。In this embodiment, the text node information is DOM node information.
具体地,提取文档对象树中的各个DOM节点信息。Specifically, each DOM node information in the document object tree is extracted.
S504:按照预设的拼接方式对各个文本节点信息进行拼接,得到需要爬取的数据。S504: The information of each text node is spliced according to a preset splicing method to obtain data to be scraped.
具体地,按照预设的拼接方式对各个DOM节点信息进行拼接,得到需要爬取的数据。Specifically, the DOM node information is spliced according to a preset splicing method to obtain data to be crawled.
需要说明的是,按照预设的拼接方式可以为按照由上到下顺序对数据信息进行拼接或按照由左到右顺序对数据信息进行拼接。It should be noted that, according to the preset splicing method, the data information can be spliced in the order from top to bottom or the data information can be spliced in the order from left to right.
为了更好地理解步骤S501、步骤S502、步骤S503和步骤S504,下面通过一个例子进行说明,具体表述如下:In order to better understand step S501, step S502, step S503, and step S504, an example is described below, and the specific expression is as follows:
例如,假设第二网页为预定义的weather forecast网页,weather forecast,中文名称为天气预报,则,去除weather forecast网页的<html>标签信息和</html>标签信息,得到<head><title>Shenzhen</title></head><body><h1>will have rain</h1><p>in the coming week</p></body>,解析<head><title>Shenzhen</title></head><body><h1>will have rain</h1><p>in the coming week</p></body>,得到<title>Shenzhen</title>、<h1>will have rain</h1>和<p>in the coming week</p>,接下来,提取<title>Shenzhen</title>、<h1>will have rain</h1>和<p>in the coming week</p>中的Shenzhen、will have rain和in the coming week,按照上到下顺序对Shenzhen、will have rain和in the coming week进行拼接,得到Shen zhen will have rain in the coming week,其中,<head>、</head>、<title>、</title>、<body>、</body>、<h1>、</h1>、<p>和</p>为HTML标签。For example, if the second webpage is a predefined weather webpage, weather webcast, and the Chinese name is weather forecast, then remove the <html> tag information and </ html> tag information of the weather webpage, and get <head> <title> Shenzhen </ title> </ head> <body> <h1> will have rain </ h1> <p> in the coming week </ p> </ body>, parse <head> <title> Shenzhen </ title> </ head> <body> <h1> will have rain </ h1> <p> in the coming week </ p> </ body>, you get <title> Shenzhen </ title>, <h1> will have harain rain < / h1> and <p> in the coming week </ p>, next, extract <title> Shenzhen </ title>, <h1> will have rain </ h1>, and <p> in the coming week </ p> > Shenzhen, will have, rain, and incoming week, and concatenate Shenzhen, will have, rain, and incoming week in order from top to bottom to get Shenzhen will will have raininin thecoming week, where <head>, </ head>, <title>, </ title>, <body>, </ body>, <h1>, </ h1>, <p>, and </ p> are HTML tags.
在图6对应的实施例中,首先,通过去除网页的标签信息,得到XML文档,然后,解析XML文档,得到XML文档的文档对象树,接下来,提取文档对象树中的各个文本节点信息,拼接各个文本节点信息,得到需要爬取的数据,通过将XML文档转换为简易的文档对象树,再将简易的文档对象树装入内存中,然后,按照容易运行的DOM对象去执行,从而能够简易高效地解析出DOM节点信息,进而提高了数据爬取的速度。In the embodiment corresponding to FIG. 6, first, an XML document is obtained by removing tag information of a web page, and then the XML document is parsed to obtain a document object tree of the XML document. Next, each text node information in the document object tree is extracted. The information of each text node is stitched to obtain the data that needs to be crawled. By converting the XML document into a simple document object tree, and then loading the simple document object tree into memory, and then executing it according to the easy-to-run DOM object, it can be Simple and efficient parsing of DOM node information, thereby improving the speed of data crawling.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
在一实施例中,提供一种数据爬取装置,该数据爬取装置与上述实施例中数据爬取方法一一对应。如图7所示,该数据爬取装置包括第一访问模块71、第一解析模块72、第二访问模块73、遍历模块74、第二解析模块75、分派模块76、第一获取模块77、第三访问模块78和保存模块79。各功能模块详细说明如下:In one embodiment, a data crawling device is provided, and the data crawling device corresponds to the data crawling method in the above embodiment one-to-one. As shown in FIG. 7, the data crawling device includes a first access module 71, a first analysis module 72, a second access module 73, a traversal module 74, a second analysis module 75, a dispatch module 76, a first acquisition module 77, Third access module 78 and storage module 79. The detailed description of each function module is as follows:
第一访问模块71,用于采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页,其中,标识频道中的网络标识信息预先由标识信息库分派,标识信息库包括可成功访问网络资源的多个网络标识信息;The first access module 71 is configured to access the first webpage corresponding to the preset first web address by using the network identification information in the identification channel, where the network identification information in the identification channel is allocated in advance by the identification information database, and the identification information database includes Multiple network identification information for successfully accessing network resources;
第一解析模块72,用于若采用标识频道中的网络标识信息访问第一网址对应的第一网页成功,且第一网址为非域名,则按照预设的第一解析方式对第一网址进行解析,得到第一网址对应的域名;The first parsing module 72 is configured to: if the first webpage corresponding to the first web address is successfully accessed using the network identification information in the identification channel, and the first web address is a non-domain name, perform a first parsing operation on the first web address according to a preset first parsing method. Parse to obtain the domain name corresponding to the first URL;
第二访问模块73,用于采用网络标识信息访问域名对应的第一网站的首页,其中,第一网站包括一个以上第二网页,第二网页包括第二网页内容;The second access module 73 is configured to access the first page of the first website corresponding to the domain name by using the network identification information, wherein the first website includes more than one second web page, and the second web page includes the second web page content;
遍历模块74,用于若采用标识频道中的网络标识信息访问第一网址对应的第一网页成功,且第一网址为域名,或者采用网络标识信息访问域名对应的第一网站的首页成功,则遍历第一网站的各个第二网页;The traversal module 74 is configured to successfully access the first webpage corresponding to the first URL using the network identification information in the identification channel, and the first URL is a domain name, or use the network identification information to successfully access the homepage of the first website corresponding to the domain name, then Traverse each second webpage of the first website;
第二解析模块75,用于若遍历第一网站的各个第二网页成功,按照预设的第二解析方式对第二网页内容进行解析,得到需要爬取的数据。The second parsing module 75 is configured to parse the content of the second webpage according to a preset second parsing method if the traversal of each second webpage of the first website is successful, to obtain data that needs to be crawled.
分派模块76,用于若采用网络标识信息访问第一网址对应的第一网页不成功,或者采用网络标识信息访问域名对应的第一网站的首页不成功,或者遍历第一网站的各个第二网页不成功,则采用Tornado异步机制分派标识信息库中的新的网络标识信息至标识频道,并触发第一访问模块71,新的网络标识信息是指未分派过至标识频道的网络标识信息。Assigning module 76 is used to access the first webpage corresponding to the first website using network identification information, or to access the first page of the first website corresponding to the domain name using network identification information, or to traverse each second webpage of the first website If it is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel and trigger the first access module 71. The new network identification information refers to network identification information that has not been assigned to the identification channel.
进一步地,在采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页之 前,该数据爬取装置还包括:Further, before the network identification information in the identification channel is used to access the first webpage corresponding to the preset first web address, the data crawling device further includes:
第一获取模块77,用于从第二网站对应的网页中获取第二网站上的网络标识信息,其中,第二网站存在有一个以上网络标识信息;A first obtaining module 77, configured to obtain network identification information on a second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;
第三访问模块78,用于采用第二网站上的网络标识信息访问预设的第二网址对应的第三网页;A third access module 78, configured to access the third webpage corresponding to the preset second web address by using the network identification information on the second website;
保存模块79,用于若第二网站上的网络标识信息访问预设的第二网址对应的第三网页成功,则将第二网站上的网络标识信息保存到标识信息库中。The saving module 79 is configured to save the network identification information on the second website to the identification information database if the network identification information on the second website successfully accesses the third webpage corresponding to the preset second URL.
进一步地,第一访问模块71包括:Further, the first access module 71 includes:
发送子模块711,用于采用网络标识信息向预设的第一网址对应的服务器发送HTTP请求;A sending sub-module 711, configured to send an HTTP request to a server corresponding to a preset first URL by using network identification information;
确定子模块712,用于若接收到服务器根据HTTP请求反馈的HTML文件,则确定采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页成功。A determining sub-module 712 is configured to determine that, if the HTML file fed back by the server according to the HTTP request is received, it is determined that the first webpage corresponding to the preset first URL is successfully accessed using the network identification information in the identification channel.
进一步地,遍历模块74包括:Further, the traversal module 74 includes:
获取标签子模块741,用于获取第一网站中的HTML的各个超链接标签,其中,超链接标签包括一个以上链接目标属性;An acquisition tag submodule 741, configured to acquire each hyperlink tag of HTML in the first website, where the hyperlink tag includes more than one link target attribute;
第一提取子模块742,用于提取各个超链接标签中的所有链接目标属性;A first extraction submodule 742, configured to extract all link target attributes in each hyperlink tag;
遍历网页子模块743,用于采用网络标识信息遍历各个链接目标属性对应的第二网页。The web page traversing submodule 743 is configured to traverse the second web page corresponding to each link target attribute by using the network identification information.
进一步地,第二解析模块75包括:Further, the second parsing module 75 includes:
去除子模块751,用于去除第二网页的标签信息,得到XML文档;A removal sub-module 751, configured to remove tag information of the second webpage to obtain an XML document;
解析文档子模块752,用于解析XML文档,得到XML文档中的文档对象树,其中,文档对象树包含一个以上文本节点信息;The parsing document sub-module 752 is used to parse the XML document to obtain a document object tree in the XML document, where the document object tree contains more than one text node information;
第二提取子模块753,用于提取文档对象树中的各个文本节点信息;A second extraction submodule 753, configured to extract information of each text node in the document object tree;
拼接子模块754,用于按照预设的拼接方式对各个文本节点信息进行拼接,得到需要爬取的数据。The splicing sub-module 754 is configured to splice each text node information according to a preset splicing method to obtain data to be scraped.
关于数据爬取装置的具体限定可以参见上文中对于数据爬取方法的限定,在此不再赘述。上述数据爬取装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the data crawling device, refer to the foregoing limitation on the data crawling method, and details are not described herein again. Each module in the above data crawling device may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储数据爬取方法有关的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种数据爬取方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store data related to the data crawling method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a data scraping method.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例数据爬取方法的步骤,例如图2所示的步骤S10至步骤S60。或者,处理器执行计算机可读指令时实现上述实施例中数据爬取装置的各模块/单元的功能,例如图7所示模块71至模块79的功能。为避免重复,这里不再赘述。In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the data climbing of the foregoing embodiment is implemented. Take the steps of the method, for example, steps S10 to S60 shown in FIG. 2. Alternatively, when the processor executes the computer-readable instructions, the functions of the modules / units of the data crawling device in the foregoing embodiment are implemented, for example, the functions of modules 71 to 79 shown in FIG. 7. To avoid repetition, we will not repeat them here.
在一个实施例中,提供了一种计算机可读存储介质,该一个或多个存储有计算机可读指令的非易失性可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行计算机可读指令时实现上述方法实施例中数据爬取方法,或者,该一个或 多个存储有计算机可读指令的非易失性可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行计算机可读指令时实现上述装置实施例中数据爬取装置中各模块/单元的功能。为避免重复,这里不再赘述。本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。In one embodiment, a computer-readable storage medium is provided, the one or more non-volatile storage mediums storing computer-readable instructions, and the computer-readable instructions are executed by one or more processors. , So that when one or more processors execute computer-readable instructions, the data scraping method in the foregoing method embodiment is implemented, or the one or more non-volatile readable storage media storing computer-readable instructions are stored by a computer, When the read instruction is executed by one or more processors, the function of each module / unit in the data crawling device in the foregoing device embodiment is implemented when the one or more processors execute computer-readable instructions. To avoid repetition, we will not repeat them here. A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included Within the scope of this application.
Claims (20)
- 一种数据爬取方法,其特征在于,所述数据爬取方法包括:A data crawling method, characterized in that the data crawling method includes:采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页,其中,所述标识频道中的网络标识信息预先由标识信息库分派,所述标识信息库包括可成功访问网络资源的多个网络标识信息;The network identification information in the identification channel is used to access the first webpage corresponding to the preset first URL, wherein the network identification information in the identification channel is assigned in advance by an identification information database, which includes the network resources that can be successfully accessed Multiple network identification information;若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为非域名,则按照预设的第一解析方式对所述第一网址进行解析,得到所述第一网址对应的域名;If access to the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a non-domain name, the first parsing is performed on the first Parse the URL to obtain the domain name corresponding to the first URL;采用所述网络标识信息访问所述域名对应的第一网站的首页,其中,所述第一网站包括一个以上第二网页,所述第二网页包括第二网页内容;Using the network identification information to access a homepage of a first website corresponding to the domain name, wherein the first website includes more than one second web page, and the second web page includes second web page content;若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为域名,或者采用所述网络标识信息访问所述域名对应的第一网站的首页成功,则遍历所述第一网站的各个第二网页;If accessing the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a domain name, or accessing the first corresponding to the domain name using the network identification information If the homepage of the website is successful, each second page of the first website is traversed;若遍历所述第一网站的各个第二网页成功,按照预设的第二解析方式对所述第二网页内容进行解析,得到需要爬取的数据;If it is successful to traverse each second webpage of the first website, analyze the content of the second webpage according to a preset second parsing method to obtain data that needs to be crawled;若采用所述网络标识信息访问所述第一网址对应的第一网页不成功,或者采用所述网络标识信息访问所述域名对应的第一网站的首页不成功,或者遍历所述第一网站的各个所述第二网页不成功,则采用Tornado异步机制分派所述标识信息库中的新的网络标识信息至所述标识频道,返回执行所述采用所述标识频道中的网络标识信息访问预设的第一网址对应的第一网页的步骤,所述新的网络标识信息是指未分派过至所述标识频道的网络标识信息。If it is unsuccessful to use the network identification information to access the first webpage corresponding to the first URL, or to use the network identification information to access the homepage of the first website corresponding to the domain name, or to traverse the first website If each of the second webpages is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and return to execute the access preset using the network identification information in the identification channel. In the step of the first webpage corresponding to the first web address, the new network identification information refers to network identification information that has not been assigned to the identification channel.
- 如权利要求1所述的数据爬取方法,其特征在于,在所述采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页的步骤之前,所述数据爬取方法还包括:The data crawling method according to claim 1, wherein before the step of using the network identification information in the identification channel to access a first webpage corresponding to a preset first URL, the data crawling method further comprises: include:从第二网站对应的网页中获取所述第二网站上的网络标识信息,其中,所述第二网站存在有一个以上网络标识信息;Obtaining network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;采用所述第二网站上的网络标识信息访问预设的第二网址对应的第三网页;Using the network identification information on the second website to access a third webpage corresponding to a preset second URL;若采用所述第二网站上的网络标识信息访问预设的第二网址对应的第三网页成功,则将所述第二网站上的网络标识信息保存到所述标识信息库中。If the access to the third webpage corresponding to the preset second web address using the network identification information on the second website is successful, the network identification information on the second website is stored in the identification information database.
- 如权利要求1所述的数据爬取方法,其特征在于,所述采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页包括:The data crawling method according to claim 1, wherein the accessing the first webpage corresponding to the preset first web address by using the network identification information in the identification channel comprises:采用所述网络标识信息向所述预设的第一网址对应的服务器发送HTTP请求;Sending the HTTP request to the server corresponding to the preset first URL by using the network identification information;若接收到所述服务器根据所述HTTP请求反馈的HTML文件,则确定采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页成功。If the HTML file fed back by the server according to the HTTP request is received, it is determined that the first webpage corresponding to the preset first web address is successfully accessed using the network identification information in the identification channel.
- 如权利要求1所述的数据爬取方法,其特征在于,所述遍历所述第一网站的各个第二网页包括:The data crawling method according to claim 1, wherein traversing each second webpage of the first website comprises:获取所述第一网站中的HTML的各个超链接标签,其中,所述超链接标签包括一个以上链接目标属性;Acquiring each hyperlink tag of HTML in the first website, wherein the hyperlink tag includes more than one link target attribute;提取各个超链接标签中的所有所述链接目标属性;Extracting all of the link target attributes in each hyperlink tag;采用所述网络标识信息遍历各个所述链接目标属性对应的第二网页。The network identification information is used to traverse a second webpage corresponding to each of the link target attributes.
- 如权利要求1至4中任一项所述的数据爬取方法,其特征在于,所述按照预设的第二解析方式对所述第二网页内容进行解析,得到需要爬取的数据包括:The data crawling method according to any one of claims 1 to 4, wherein the parsing the content of the second webpage according to a preset second parsing method to obtain data to be crawled comprises:去除所述第二网页的标签信息,得到XML文档;Removing the tag information of the second webpage to obtain an XML document;解析所述XML文档,得到XML文档中的文档对象树,其中,所述文档对象树包含 一个以上文本节点信息;Parse the XML document to obtain a document object tree in the XML document, where the document object tree includes more than one text node information;提取所述文档对象树中的各个文本节点信息;Extracting information of each text node in the document object tree;按照预设的拼接方式对所述各个文本节点信息进行拼接,得到需要爬取的数据。The information of each text node is stitched according to a preset stitching method to obtain data to be scraped.
- 一种数据爬取装置,其特征在于,所述数据爬取装置包括:A data crawling device, characterized in that the data crawling device includes:第一访问模块,用于采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页,其中,所述标识频道中的网络标识信息预先由标识信息库分派,所述标识信息库包括可成功访问网络资源的多个网络标识信息;A first access module, configured to access a first webpage corresponding to a preset first web address by using network identification information in an identification channel, wherein the network identification information in the identification channel is pre-assigned by an identification information database, and the identification information The library includes multiple network identification information that can successfully access network resources;第一解析模块,用于若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为非域名,则按照预设的第一解析方式对所述第一网址进行解析,得到所述第一网址对应的域名;A first parsing module, configured to use the network identification information in the identification channel to successfully access the first webpage corresponding to the first URL, and the first URL is a non-domain name, according to a preset first Parsing the first URL to obtain a domain name corresponding to the first URL;第二访问模块,用于采用所述网络标识信息访问所述域名对应的第一网站的首页,其中,所述第一网站包括一个以上第二网页,所述第二网页包括第二网页内容;A second access module, configured to access the first page of the first website corresponding to the domain name by using the network identification information, wherein the first website includes more than one second web page, and the second web page includes second web page content;遍历模块,用于若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为域名,或者采用所述网络标识信息访问所述域名对应的第一网站的首页成功,则遍历所述第一网站的各个第二网页;The traversal module is configured to access the first webpage corresponding to the first URL successfully using the network identification information in the identification channel, and the first URL is a domain name, or access the network address using the network identification information. If the first page of the first website corresponding to the domain name is successful, the second pages of the first website are traversed;第二解析模块,用于若遍历所述第一网站的各个第二网页成功,按照预设的第二解析方式对所述第二网页内容进行解析,得到需要爬取的数据;A second parsing module, configured to parse the content of the second webpage according to a preset second parsing method if the traversal of each second webpage of the first website is successful, to obtain data that needs to be crawled;分派模块,用于若采用所述网络标识信息访问所述第一网址对应的第一网页不成功,或者采用所述网络标识信息访问所述域名对应的第一网站的首页不成功,或者遍历所述第一网站的各个所述第二网页不成功,则采用Tornado异步机制分派所述标识信息库中的新的网络标识信息至所述标识频道,并触发所述第一访问模块,所述新的网络标识信息是指未分派过至所述标识频道的网络标识信息。An assignment module, configured to access the first webpage corresponding to the first web site using the network identification information if the first page corresponding to the domain name is unsuccessful using the network identification information or to traverse Each of the second webpages of the first website is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and trigger the first access module, the new The network identification information refers to network identification information that has not been assigned to the identification channel.
- 如权利要求6所述的数据爬取装置,其特征在于,所述数据爬取装置还包括:The data crawling device according to claim 6, wherein the data crawling device further comprises:第一获取模块,用于从第二网站对应的网页中获取所述第二网站上的网络标识信息,其中,所述第二网站存在有一个以上网络标识信息;A first obtaining module, configured to obtain network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;第三访问模块,用于采用所述第二网站上的网络标识信息访问预设的第二网址对应的第三网页;A third access module, configured to use the network identification information on the second website to access a third webpage corresponding to a preset second URL;保存模块,用于若采用所述第二网站上的网络标识信息访问预设的第二网址对应的第三网页成功,则将所述第二网站上的网络标识信息保存到所述标识信息库中。A saving module, configured to save network identification information on the second website to the identification information database if the third website corresponding to the preset second website is successfully accessed using the network identification information on the second website in.
- 如权利要求6所述的数据爬取装置,其特征在于,所述第一访问模块包括:The data crawling device according to claim 6, wherein the first access module comprises:发送子模块,用于采用所述网络标识信息向所述预设的第一网址对应的服务器发送HTTP请求;A sending submodule, configured to use the network identification information to send an HTTP request to a server corresponding to the preset first URL;确定子模块,用于若接收到所述服务器根据所述HTTP请求反馈的HTML文件,则确定采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页成功。A determining submodule is configured to, if the HTML file fed back by the server according to the HTTP request is received, determine that the first webpage corresponding to the preset first URL is successfully accessed using the network identification information in the identification channel.
- 如权利要求6所述的数据爬取装置,其特征在于,所述遍历模块包括:The data crawling device according to claim 6, wherein the traversal module comprises:获取标签子模块,用于获取所述第一网站中的HTML的各个超链接标签,其中,所述超链接标签包括一个以上链接目标属性;An acquisition tag submodule, configured to acquire each hyperlink tag of the HTML in the first website, wherein the hyperlink tag includes more than one link target attribute;第一提取子模块,用于提取各个超链接标签中的所有所述链接目标属性;A first extraction submodule, configured to extract all the link target attributes in each hyperlink tag;遍历网页子模块,用于采用所述网络标识信息遍历各个所述链接目标属性对应的第二网页。A web page traversing submodule is configured to use the network identification information to traverse a second web page corresponding to each of the link target attributes.
- 如权利要求6至9中任一项所述的数据爬取装置,其特征在于,所述第二解析模块包括:The data crawling device according to any one of claims 6 to 9, wherein the second parsing module comprises:去除子模块,用于去除所述第二网页的标签信息,得到XML文档;A removing sub-module for removing tag information of the second webpage to obtain an XML document;解析文档子模块,用于解析所述XML文档,得到XML文档中的文档对象树,其中,所述文档对象树包含一个以上文本节点信息;The parsing document submodule is used to parse the XML document to obtain a document object tree in the XML document, where the document object tree includes more than one text node information;第二提取子模块,用于提取所述文档对象树中的各个文本节点信息;A second extraction submodule, configured to extract information of each text node in the document object tree;拼接子模块,用于按照预设的拼接方式对所述各个文本节点信息进行拼接,得到需要爬取的数据。The splicing sub-module is configured to splice the respective text node information according to a preset splicing method to obtain data to be scraped.
- 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页,其中,所述标识频道中的网络标识信息预先由标识信息库分派,所述标识信息库包括可成功访问网络资源的多个网络标识信息;The network identification information in the identification channel is used to access the first webpage corresponding to the preset first URL, wherein the network identification information in the identification channel is assigned in advance by an identification information database, which includes the network resources that can be successfully accessed Multiple network identification information;若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为非域名,则按照预设的第一解析方式对所述第一网址进行解析,得到所述第一网址对应的域名;If access to the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a non-domain name, the first parsing is performed on the first Parse the URL to obtain the domain name corresponding to the first URL;采用所述网络标识信息访问所述域名对应的第一网站的首页,其中,所述第一网站包括一个以上第二网页,所述第二网页包括第二网页内容;Using the network identification information to access a homepage of a first website corresponding to the domain name, wherein the first website includes more than one second web page, and the second web page includes second web page content;若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为域名,或者采用所述网络标识信息访问所述域名对应的第一网站的首页成功,则遍历所述第一网站的各个第二网页;If accessing the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a domain name, or accessing the first corresponding to the domain name using the network identification information If the homepage of the website is successful, each second page of the first website is traversed;若遍历所述第一网站的各个第二网页成功,按照预设的第二解析方式对所述第二网页内容进行解析,得到需要爬取的数据;If it is successful to traverse each second webpage of the first website, analyze the content of the second webpage according to a preset second parsing method to obtain data that needs to be crawled;若采用所述网络标识信息访问所述第一网址对应的第一网页不成功,或者采用所述网络标识信息访问所述域名对应的第一网站的首页不成功,或者遍历所述第一网站的各个所述第二网页不成功,则采用Tornado异步机制分派所述标识信息库中的新的网络标识信息至所述标识频道,返回执行所述采用所述标识频道中的网络标识信息访问预设的第一网址对应的第一网页的步骤,所述新的网络标识信息是指未分派过至所述标识频道的网络标识信息。If it is unsuccessful to use the network identification information to access the first webpage corresponding to the first URL, or to use the network identification information to access the homepage of the first website corresponding to the domain name, or to traverse the first website If each of the second webpages is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and return to execute the access preset using the network identification information in the identification channel In the step of the first webpage corresponding to the first web address, the new network identification information refers to network identification information that has not been assigned to the identification channel.
- 如权利要求11所述的计算机设备,其特征在于,在所述采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页的步骤之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 11, wherein before the step of accessing a first webpage corresponding to a preset first web address by using the network identification information in the identification channel, the processor executes the computer may The following steps are also implemented when reading instructions:从第二网站对应的网页中获取所述第二网站上的网络标识信息,其中,所述第二网站存在有一个以上网络标识信息;Obtaining network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;采用所述第二网站上的网络标识信息访问预设的第二网址对应的第三网页;Using the network identification information on the second website to access a third webpage corresponding to a preset second URL;若采用所述第二网站上的网络标识信息访问预设的第二网址对应的第三网页成功,则将所述第二网站上的网络标识信息保存到所述标识信息库中。If the access to the third webpage corresponding to the preset second web address using the network identification information on the second website is successful, the network identification information on the second website is stored in the identification information database.
- 如权利要求11所述的计算机设备,其特征在于,所述采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页包括:The computer device according to claim 11, wherein the accessing the first webpage corresponding to the preset first web address by using the network identification information in the identification channel comprises:采用所述网络标识信息向所述预设的第一网址对应的服务器发送HTTP请求;Sending the HTTP request to the server corresponding to the preset first URL by using the network identification information;若接收到所述服务器根据所述HTTP请求反馈的HTML文件,则确定采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页成功。If the HTML file fed back by the server according to the HTTP request is received, it is determined that the first webpage corresponding to the preset first web address is successfully accessed using the network identification information in the identification channel.
- 如权利要求11所述的计算机设备,其特征在于,所述遍历所述第一网站的各个第二网页包括:The computer device of claim 11, wherein each of the second web pages traversing the first website comprises:获取所述第一网站中的HTML的各个超链接标签,其中,所述超链接标签包括一个以上链接目标属性;Acquiring each hyperlink tag of HTML in the first website, wherein the hyperlink tag includes more than one link target attribute;提取各个超链接标签中的所有所述链接目标属性;Extracting all of the link target attributes in each hyperlink tag;采用所述网络标识信息遍历各个所述链接目标属性对应的第二网页。The network identification information is used to traverse a second webpage corresponding to each of the link target attributes.
- 如权利要求11至14中任一项所述的计算机设备,其特征在于,所述按照预设的 第二解析方式对所述第二网页内容进行解析,得到需要爬取的数据包括:The computer device according to any one of claims 11 to 14, wherein the parsing the content of the second webpage according to a preset second parsing method to obtain data to be crawled comprises:去除所述第二网页的标签信息,得到XML文档;Removing the tag information of the second webpage to obtain an XML document;解析所述XML文档,得到XML文档中的文档对象树,其中,所述文档对象树包含一个以上文本节点信息;Parse the XML document to obtain a document object tree in the XML document, wherein the document object tree includes more than one text node information;提取所述文档对象树中的各个文本节点信息;Extracting information of each text node in the document object tree;按照预设的拼接方式对所述各个文本节点信息进行拼接,得到需要爬取的数据。The information of each text node is stitched according to a preset stitching method to obtain data to be scraped.
- 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页,其中,所述标识频道中的网络标识信息预先由标识信息库分派,所述标识信息库包括可成功访问网络资源的多个网络标识信息;The network identification information in the identification channel is used to access the first webpage corresponding to the preset first URL, wherein the network identification information in the identification channel is assigned in advance by an identification information database, which includes the network resources that can be successfully accessed Multiple network identification information;若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为非域名,则按照预设的第一解析方式对所述第一网址进行解析,得到所述第一网址对应的域名;If access to the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a non-domain name, the first parsing is performed on the first Parse the URL to obtain the domain name corresponding to the first URL;采用所述网络标识信息访问所述域名对应的第一网站的首页,其中,所述第一网站包括一个以上第二网页,所述第二网页包括第二网页内容;Using the network identification information to access a homepage of a first website corresponding to the domain name, wherein the first website includes more than one second web page, and the second web page includes second web page content;若采用所述标识频道中的所述网络标识信息访问所述第一网址对应的第一网页成功,且所述第一网址为域名,或者采用所述网络标识信息访问所述域名对应的第一网站的首页成功,则遍历所述第一网站的各个第二网页;If accessing the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a domain name, or accessing the first corresponding to the domain name using the network identification information If the homepage of the website is successful, each second page of the first website is traversed;若遍历所述第一网站的各个第二网页成功,按照预设的第二解析方式对所述第二网页内容进行解析,得到需要爬取的数据;If it is successful to traverse each second webpage of the first website, analyze the content of the second webpage according to a preset second parsing method to obtain data that needs to be crawled;若采用所述网络标识信息访问所述第一网址对应的第一网页不成功,或者采用所述网络标识信息访问所述域名对应的第一网站的首页不成功,或者遍历所述第一网站的各个所述第二网页不成功,则采用Tornado异步机制分派所述标识信息库中的新的网络标识信息至所述标识频道,返回执行所述采用所述标识频道中的网络标识信息访问预设的第一网址对应的第一网页的步骤,所述新的网络标识信息是指未分派过至所述标识频道的网络标识信息。If it is unsuccessful to use the network identification information to access the first webpage corresponding to the first URL, or to use the network identification information to access the homepage of the first website corresponding to the domain name, or to traverse the first website If each of the second webpages is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and return to execute the access preset using the network identification information in the identification channel. In the step of the first webpage corresponding to the first web address, the new network identification information refers to network identification information that has not been assigned to the identification channel.
- 如权利要求16所述的非易失性可读存储介质,其特征在于,在所述采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页的步骤之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The non-volatile readable storage medium according to claim 16, wherein before the step of accessing a first webpage corresponding to a preset first web address by using network identification information in an identification channel, the computer When the readable instruction is executed by one or more processors, the one or more processors further perform the following steps:从第二网站对应的网页中获取所述第二网站上的网络标识信息,其中,所述第二网站存在有一个以上网络标识信息;Obtaining network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;采用所述第二网站上的网络标识信息访问预设的第二网址对应的第三网页;Using the network identification information on the second website to access a third webpage corresponding to a preset second URL;若采用所述第二网站上的网络标识信息访问预设的第二网址对应的第三网页成功,则将所述第二网站上的网络标识信息保存到所述标识信息库中。If the access to the third webpage corresponding to the preset second web address using the network identification information on the second website is successful, the network identification information on the second website is stored in the identification information database.
- 如权利要求16所述的非易失性可读存储介质,其特征在于,所述采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页包括:The non-volatile readable storage medium according to claim 16, wherein the accessing the first webpage corresponding to the preset first web address by using the network identification information in the identification channel comprises:采用所述网络标识信息向所述预设的第一网址对应的服务器发送HTTP请求;Sending the HTTP request to the server corresponding to the preset first URL by using the network identification information;若接收到所述服务器根据所述HTTP请求反馈的HTML文件,则确定采用标识频道中的网络标识信息访问预设的第一网址对应的第一网页成功。If the HTML file fed back by the server according to the HTTP request is received, it is determined that the first webpage corresponding to the preset first web address is successfully accessed using the network identification information in the identification channel.
- 如权利要求16所述的非易失性可读存储介质,其特征在于,所述遍历所述第一网站的各个第二网页包括:The non-volatile readable storage medium of claim 16, wherein each of the second webpages traversing the first website comprises:获取所述第一网站中的HTML的各个超链接标签,其中,所述超链接标签包括一个以上链接目标属性;Acquiring each hyperlink tag of HTML in the first website, wherein the hyperlink tag includes more than one link target attribute;提取各个超链接标签中的所有所述链接目标属性;Extracting all of the link target attributes in each hyperlink tag;采用所述网络标识信息遍历各个所述链接目标属性对应的第二网页。The network identification information is used to traverse a second webpage corresponding to each of the link target attributes.
- 如权利要求16至19中任一项所述的非易失性可读存储介质,其特征在于,所述按照预设的第二解析方式对所述第二网页内容进行解析,得到需要爬取的数据包括:The non-volatile readable storage medium according to any one of claims 16 to 19, wherein the second webpage content is parsed according to a preset second parsing manner to obtain crawling requirements The data includes:去除所述第二网页的标签信息,得到XML文档;Removing the tag information of the second webpage to obtain an XML document;解析所述XML文档,得到XML文档中的文档对象树,其中,所述文档对象树包含一个以上文本节点信息;Parse the XML document to obtain a document object tree in the XML document, wherein the document object tree includes more than one text node information;提取所述文档对象树中的各个文本节点信息;Extracting information of each text node in the document object tree;按照预设的拼接方式对所述各个文本节点信息进行拼接,得到需要爬取的数据。The information of each text node is stitched according to a preset stitching method to obtain data to be scraped.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810594254.9A CN108897788B (en) | 2018-06-11 | 2018-06-11 | Data crawling method and device, computer equipment and storage medium |
CN201810594254.9 | 2018-06-11 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019237547A1 true WO2019237547A1 (en) | 2019-12-19 |
Family
ID=64344878
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/106397 WO2019237547A1 (en) | 2018-06-11 | 2018-09-19 | Data crawling method and apparatus, and computer device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108897788B (en) |
WO (1) | WO2019237547A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111859867A (en) * | 2020-07-20 | 2020-10-30 | 广西美立方工程咨询有限公司 | Web data extraction system based on XML and XPath and use method thereof |
CN113656737A (en) * | 2021-08-20 | 2021-11-16 | 北京百度网讯科技有限公司 | Webpage content display method and device, electronic equipment and storage medium |
CN113806732A (en) * | 2020-06-16 | 2021-12-17 | 深信服科技股份有限公司 | Webpage tampering detection method, device, equipment and storage medium |
CN114143290A (en) * | 2021-11-19 | 2022-03-04 | 国家计算机网络与信息安全管理中心广东分中心 | System and method for constructing IP proxy pool for multi-website parallel crawling |
CN116361362A (en) * | 2023-05-30 | 2023-06-30 | 江西顶易科技发展有限公司 | User information mining method and system based on webpage content identification |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110245281B (en) * | 2019-05-22 | 2023-07-21 | 中国平安人寿保险股份有限公司 | Internet asset information collection method and terminal equipment |
CN112685620A (en) * | 2020-12-31 | 2021-04-20 | 山东奥邦交通设施工程有限公司 | Bidding information processing method, system, readable storage medium and device |
CN113821705B (en) * | 2021-08-30 | 2024-02-20 | 湖南大学 | Webpage content acquisition method, terminal equipment and readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294369A (en) * | 2015-05-15 | 2017-01-04 | 北京国双科技有限公司 | Web data acquisition methods and device |
US20170185678A1 (en) * | 2015-12-28 | 2017-06-29 | Le Holdings (Beijing) Co., Ltd. | Crawler system and method |
CN107105071A (en) * | 2017-05-05 | 2017-08-29 | 北京京东金融科技控股有限公司 | IP call methods and device, storage medium, electronic equipment |
CN108038218A (en) * | 2017-12-22 | 2018-05-15 | 联想(北京)有限公司 | A kind of distributed reptile method, electronic equipment and server |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561814B (en) * | 2009-05-08 | 2012-05-09 | 华中科技大学 | Topic crawler system based on social labels |
CN103164446A (en) * | 2011-12-14 | 2013-06-19 | 阿里巴巴集团控股有限公司 | Webpage request information response method and webpage request information response device |
CN107066576B (en) * | 2017-04-12 | 2019-11-12 | 成都四方伟业软件股份有限公司 | A kind of big data web crawlers paging selection method and system |
CN108062413B (en) * | 2017-12-30 | 2019-05-28 | 平安科技(深圳)有限公司 | Web data processing method, device, computer equipment and storage medium |
-
2018
- 2018-06-11 CN CN201810594254.9A patent/CN108897788B/en active Active
- 2018-09-19 WO PCT/CN2018/106397 patent/WO2019237547A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294369A (en) * | 2015-05-15 | 2017-01-04 | 北京国双科技有限公司 | Web data acquisition methods and device |
US20170185678A1 (en) * | 2015-12-28 | 2017-06-29 | Le Holdings (Beijing) Co., Ltd. | Crawler system and method |
CN107105071A (en) * | 2017-05-05 | 2017-08-29 | 北京京东金融科技控股有限公司 | IP call methods and device, storage medium, electronic equipment |
CN108038218A (en) * | 2017-12-22 | 2018-05-15 | 联想(北京)有限公司 | A kind of distributed reptile method, electronic equipment and server |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113806732A (en) * | 2020-06-16 | 2021-12-17 | 深信服科技股份有限公司 | Webpage tampering detection method, device, equipment and storage medium |
CN113806732B (en) * | 2020-06-16 | 2023-11-03 | 深信服科技股份有限公司 | Webpage tampering detection method, device, equipment and storage medium |
CN111859867A (en) * | 2020-07-20 | 2020-10-30 | 广西美立方工程咨询有限公司 | Web data extraction system based on XML and XPath and use method thereof |
CN111859867B (en) * | 2020-07-20 | 2024-03-12 | 广西美立方工程咨询有限公司 | Web data extraction system based on XML and XPath and use method thereof |
CN113656737A (en) * | 2021-08-20 | 2021-11-16 | 北京百度网讯科技有限公司 | Webpage content display method and device, electronic equipment and storage medium |
CN113656737B (en) * | 2021-08-20 | 2024-05-14 | 北京百度网讯科技有限公司 | Webpage content display method and device, electronic equipment and storage medium |
CN114143290A (en) * | 2021-11-19 | 2022-03-04 | 国家计算机网络与信息安全管理中心广东分中心 | System and method for constructing IP proxy pool for multi-website parallel crawling |
CN114143290B (en) * | 2021-11-19 | 2024-01-30 | 国家计算机网络与信息安全管理中心广东分中心 | System and method for constructing IP proxy pool of multi-website parallel crawling |
CN116361362A (en) * | 2023-05-30 | 2023-06-30 | 江西顶易科技发展有限公司 | User information mining method and system based on webpage content identification |
CN116361362B (en) * | 2023-05-30 | 2023-08-11 | 江西顶易科技发展有限公司 | User information mining method and system based on webpage content identification |
Also Published As
Publication number | Publication date |
---|---|
CN108897788B (en) | 2023-04-07 |
CN108897788A (en) | 2018-11-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019237547A1 (en) | Data crawling method and apparatus, and computer device and storage medium | |
TWI670611B (en) | Web file sending method, webpage rendering method and device, webpage rendering system | |
US10015226B2 (en) | Methods for making AJAX web applications bookmarkable and crawlable and devices thereof | |
WO2016173200A1 (en) | Malicious website detection method and system | |
US8543713B2 (en) | Computing environment arranged to support predetermined URL patterns | |
CN112073405A (en) | Webpage data loading method and device, computer equipment and storage medium | |
US8387008B2 (en) | Method for sharing a function between web contents | |
CN110221871B (en) | Webpage acquisition method and device, computer equipment and storage medium | |
CN110855766A (en) | Method and device for accessing Web resources and proxy server | |
CN103577427A (en) | Browser kernel based web page crawling method and device and browser containing device | |
CN106126693A (en) | The sending method of the related data of a kind of webpage and device | |
CN109600385B (en) | Access control method and device | |
CN112637361B (en) | Page proxy method, device, electronic equipment and storage medium | |
CN105528369B (en) | Webpage code-transferring method, device and server | |
EP3896940A1 (en) | Resource description file processing, and page resource obtaining method and device | |
CN111460254B (en) | Webpage crawling method and device based on multithreading, storage medium and equipment | |
CN103347069A (en) | Method and device for realizing network access | |
CN111723314A (en) | Webpage display method and device, electronic equipment and computer readable storage medium | |
CN109561131A (en) | A kind of method and electronic equipment of the downloading of language based on programming excel data | |
CN107688650A (en) | A kind of web page generation method and device | |
CN111680247A (en) | Local calling method, device, equipment and storage medium of webpage character string | |
US8402367B1 (en) | Smart reload pages | |
CN113127788B (en) | Page processing method, object processing method, device and equipment | |
CN104679786A (en) | Form processing method and device | |
US20140237133A1 (en) | Page download control method, system and program for ie core browser |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18922236 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18922236 Country of ref document: EP Kind code of ref document: A1 |