WO2019237547A1

WO2019237547A1 - Data crawling method and apparatus, and computer device and storage medium

Info

Publication number: WO2019237547A1
Application number: PCT/CN2018/106397
Authority: WO
Inventors: 蔡俊
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-06-11
Filing date: 2018-09-19
Publication date: 2019-12-19
Also published as: CN108897788B; CN108897788A

Abstract

A data crawling method and apparatus, and a computer device and a storage medium. The method comprises: accessing a first webpage by using network identification information; if the access is successful and the first website is a non-domain name, parsing a first website to obtain a domain name corresponding to the first website; accessing a homepage of the first website that corresponds to the domain name; if the access is successful and the first website is the domain name or the access to the homepage of the first website that corresponds to the domain name is successful, traversing all second webpages; if the access is successful, parsing the content of the second webpages to obtain data which needs to be crawled; if the access to the first webpage corresponding to the first website is unsuccessful or the access to the homepage of the first website that corresponds to the domain name is unsuccessful or the traverse of all the second webpages is unsuccessful, assigning new network identification information to an identification channel by using Tornado, and returning to execute the step of accessing the corresponding first webpage by using the network identification information, such that the stability of data crawling is improved.

Description

Data crawling method, device, computer equipment and storage medium

This application is based on a Chinese invention patent application filed on June 11, 2018 with application number 201810594254.9 and entitled "Data Crawling Method, Device, Computer Equipment, and Storage Medium" and claims its priority.

Technical field

The present application relates to the field of finance, and in particular, to a data crawling method, device, computer device, and storage medium.

Background technique

At present, in the financial industry, data information is becoming more and more important for financial companies. Financial companies usually need to crawl a large amount of effective information from the target website through the network.

The traditional information crawling method is to use an IP address to frequently crawl the target website. Because the first website has an anti-crawl mode, the number of visits to the target website by an IP address is limited within a preset period of time. The number of visits to the first website within the set time period reaches the preset limit, and crawling can only be performed within the next preset time period, and even the target website blocks the IP address as a malicious IP, resulting in The stability of crawling information is low.

Summary of the Invention

Based on this, it is necessary to provide a data crawling method, device, computer equipment, and storage medium that can improve data crawling stability and have low stability in view of the above technical problems.

A data crawling method includes: using network identification information in an identification channel to access a first webpage corresponding to a preset first URL, wherein the network identification information in the identification channel is pre-assigned by an identification information database, and The identification information database includes multiple network identification information that can successfully access network resources; if the network identification information in the identification channel is used to successfully access the first webpage corresponding to the first URL, and the first URL is not Domain name, the first URL is parsed according to a preset first resolution method to obtain the domain name corresponding to the first URL; and the network identification information is used to access the homepage of the first website corresponding to the domain name, where: The first website includes more than one second webpage, and the second webpage includes second webpage content; if the network identification information in the identification channel is used to access the first webpage corresponding to the first URL, and If the first URL is a domain name, or the first page of the first website corresponding to the domain name is successfully accessed using the network identification information, the first URL is traversed. Each second webpage of the website; if each second webpage of the first website is traversed successfully, the content of the second webpage is parsed according to a preset second parsing method to obtain data that needs to be crawled; if the It is unsuccessful to access the first webpage corresponding to the first web address by the network identification information, or to access the first page of the first website corresponding to the domain name using the network identification information, or to traverse each of the first websites of the first website. If the two webpages are unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and return to execute the access to the preset first URL using the network identification information in the identification channel. For the corresponding first webpage step, the new network identification information refers to network identification information that has not been assigned to the identification channel.

A data crawling device includes a first access module for accessing a first webpage corresponding to a preset first web address by using network identification information in an identification channel, wherein the network identification information in the identification channel is previously determined by Dispatched by an identification information base, the identification information base includes a plurality of network identification information that can successfully access network resources; a first parsing module, configured to access the first URL corresponding to the network identification information by using the network identification information in the identification channel The first webpage is successful, and the first URL is a non-domain name, the first URL is parsed according to a preset first parsing method to obtain a domain name corresponding to the first URL; a second access module uses And using the network identification information to access a homepage of a first website corresponding to the domain name, wherein the first website includes more than one second web page, and the second web page includes second web page content; a traversal module is configured to: Using the network identification information in the identification channel to access the first webpage corresponding to the first URL is successful, and the first URL is a domain name, or If the first page of the first website corresponding to the domain name is successfully accessed by using the network identification information, each second page of the first website is traversed; and a second parsing module is configured to traverse each second page of the first website if The webpage is successful, and the content of the second webpage is parsed according to a preset second parsing method to obtain data that needs to be crawled. A dispatching module is configured to access the first corresponding to the first URL if the network identification information is used. If the webpage is unsuccessful, or the homepage of the first website corresponding to the domain name is not successfully accessed using the network identification information, or each of the second webpages of the first website is not successfully traversed, the Tornado asynchronous mechanism is used to dispatch the The new network identification information in the identification information database is transmitted to the identification channel, and the first access module is triggered. The new network identification information refers to network identification information that has not been assigned to the identification channel.

A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor implements the data scraping method when the processor executes the computer-readable instructions A step of.

One or more non-volatile readable storage media storing computer readable instructions, the computer readable storage medium storing computer readable instructions, so that the one or more processors execute the data scraping method described above A step of.

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is a schematic diagram of an application environment of a data crawling method according to an embodiment of the present application; FIG.

2 is a flowchart of a data crawling method according to an embodiment of the present application;

FIG. 3 is a flowchart of an implementation of obtaining network identification information from a proxy website in a data crawling method provided by an embodiment of the present application; FIG.

4 is a flowchart of implementing step S10 in a data crawling method provided by an embodiment of the present application;

FIG. 5 is a flowchart of implementation of traversing each web page in a data crawling method provided by an embodiment of the present application; FIG.

FIG. 6 is a flowchart of implementation of parsing webpage content in a data crawling method provided by an embodiment of the present application; FIG.

7 is a schematic diagram of a data crawling device according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.

detailed description

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

The data crawling method provided in this application can be applied in the application environment shown in FIG. 1, where a computer device communicates with a server through a network. First, the server accesses the first webpage of the client corresponding to the preset first URL by using the network identification information allocated in advance by the identification information database in the identification channel. If the server uses the network identification information in the identification channel to access the first URL, If the first webpage of the client is successful and the first URL is a non-domain name, the server parses the first URL according to a preset first parsing method, so as to obtain the domain name corresponding to the first URL. Then, the server uses the network The identification information accesses the homepage of the client of the first website corresponding to the domain name. If the server successfully uses the network identification information in the identification channel to access the first webpage of the client corresponding to the first URL, and the first URL is a domain name or is accessed using network identification information If the first page of the client of the first website corresponding to the domain name is successful, the server traverses the second pages of the client of the first website. Next, the server determines that the second pages of the client of the first website are traversed successfully. The preset second parsing method parses the content of the second webpage of the client, so that Data that needs to be crawled. Finally, if the server uses network identification information to access the first webpage of the client corresponding to the first URL, or the network identification information to access the first page of the client's first website corresponding to the domain name is unsuccessful, or iterates Each second page of the client of the first website is unsuccessful, the server uses the Tornado asynchronous mechanism to assign new network identification information in the identification information database to the identification channel, and the server returns to execute the preset access using the network identification information in the identification channel. The first URL corresponds to the first web page of the client. Among them, the computer device may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of multiple servers.

In an embodiment, as shown in FIG. 2, a data crawling method is provided. The data crawling method is applied in the financial industry. The method is applied to the server in FIG. 1 as an example for description, and includes the following steps. :

S10: Use the network identification information in the identification channel to access the first webpage corresponding to the preset first URL;

In the embodiment of the present application, the identification channel refers to a channel that temporarily stores network identification information that can successfully access network resources. The network identification information refers to the identification information of the machine in the network, that is, the IP address. IP address, the English name is Internet Protocol Address, refers to the Internet Protocol address. The identification information database refers to a database specifically used to store network identification information that can successfully access network resources. The network identification information in the identification channel is assigned in advance by an identification information base, and the identification information base includes a plurality of network identification information that can successfully access network resources.

It should be noted that there may be multiple identification channels, and one identification channel has one piece of network identification information that can be used to successfully access network resources. The network identification information that can be used to successfully access network resources can be invalidated by external restrictions, that is, the network identification information that can be used to successfully access network resources can be blocked by a website and can no longer access the website.

Specifically, first, an IP address of an Internet access device is set to an IP address in a channel that temporarily stores an IP address that can successfully access network resources, and then a browser in the Internet access device is used to access a first URL corresponding to a preset first URL. A webpage, wherein the IP address in the channel is assigned in advance from an identification information base, and the identification information base includes a plurality of IP addresses that can successfully access network resources.

It should be noted that the preset first URL can be http://www.xinhuanet.com/fortune/2018-02/08/c_129808453.html, and the specific content of the preset first URL can be based on actual application requirements. Make settings, there is no restriction here.

S20: If the access to the first webpage corresponding to the first URL using the network identification information in the identification channel is successful and the first URL is a non-domain name, the first URL is parsed according to a preset first parsing method to obtain the first URL Corresponding domain name;

In the examples of this application, the domain name, the English name is Domain Name, refers to the name of a computer or computer group on the Internet consisting of a series of names separated by dots, used to identify the electronic position of the computer during data transmission . The Internet, the Chinese name for the Internet, refers to a global network of computers connected to each other using a common language.

Specifically, if a browser in an Internet access device that uses an IP address as an IP address in a channel is accessed, the first webpage corresponding to the first web address is successfully accessed, and the first web address is an Internet not composed of a series of names separated by dots. The name of a certain computer or computer group is parsed according to a preset first resolution method to obtain a domain name corresponding to the first URL.

It should be noted that the preset first parsing method may be to directly extract the content between the double slash "//" and the first single slash "/" in a URL arranged in order from left to right. The specific content of the first analysis method can be set according to actual application requirements, and is not limited here.

In order to better understand step S20, an example is described below, and the specific expression is as follows:

For example, suppose the Internet access device is a personal computer, the browser is Internet Explorer, the identification channel is channel A, the IP address is "42.55.173.190, port 80", and the first URL is http://news.163.com/18/ 0130/12 / D9DA7M9S000181BT.html, the default first parsing method is to directly extract the content between the double slash "//" and the first single slash "/" in a URL. If the IP address is The IE browser in the personal computer of "42.55.173.190, port 80" accesses the first webpage corresponding to http://news.163.com/18/0130/12/D9DA7M9S000181BT.html, and http: // news. 163.com/18/0130/12/D9DA7M9S000181BT.html is the name of a computer or computer group on the Internet that is not composed of a series of dot-separated names, and it directly extracts http from left to right: //news.163.com/18/0130/12/D9DA7M9S000181BT.html The content between the double slash "//" and the first single slash "/" is http://news.163.com /18/0130/12/D9DA7M9S000181BT.html corresponds to news.163.com.

S30: Use the network identification information to access the homepage of the first website corresponding to the domain name, where the first website includes more than one second web page, and the second web page includes the second web page content;

Specifically, first, the IP address of the Internet access device is set to the IP address in the identification channel, and then the browser on the Internet access device is used to access the homepage of the first website corresponding to the domain name, where the first website includes more than one second Webpage, the second webpage includes the content of the second webpage.

It should be noted that the first website may be the official website of NetEase News, and the second website includes the first website. The specific content of the first website may be set according to actual application requirements, and is not limited here.

S40: If the first page corresponding to the first URL is successfully accessed using the network identification information in the identification channel, and the first URL is a domain name, or the first page of the first website corresponding to the domain name is accessed using the network identification information, the first website is traversed Each second page of

Specifically, if the browser in the Internet access device that uses the IP address as the IP address in the identification channel successfully accesses the first webpage corresponding to the first URL, and the first URL is a domain name, or uses the IP address to access the first website corresponding to the domain name Successful homepage, iterates through the second pages of the first website.

S50: If it is successful to traverse each second webpage of the first website, analyze the content of the second webpage according to a preset second parsing method to obtain data that needs to be crawled;

Specifically, if the browser in the Internet access device that uses the IP address as the IP address in the identification channel successfully traverses each second webpage of the first website, the content of the second webpage is parsed according to a preset second parsing method, and the required Crawled data.

It should be noted that the preset second parsing method may be parsing a webpage using a JAXP tool. JAXP tools are tools for processing XML documents. JAXP, the English name for Java API for XML Processing, refers to a Java application program interface for parsing and validating XML documents. An XML document is a markup language document that is used to mark electronic files to make them structured. Java refers to an object-oriented programming language. The specific content of the preset second analysis method may be set according to actual application requirements, and is not limited here.

S60: If it is unsuccessful to use the network identification information to access the first webpage corresponding to the first URL, or to use the network identification information to access the homepage of the first website corresponding to the domain name, or to traverse each second webpage of the first website, The Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and the process returns to step S10. The new network identification information refers to network identification information that has not been assigned to the identification channel.

In the embodiment of the present application, Tornado is an open source version of Web server software. The Web, which is called World Wide Web in English, refers to a distributed graphical information system based on hypertext and HTTP, global, dynamic interaction, and cross-platform. HTTP in English is called HyperText Transfer Protocol. It refers to the Hypertext Transfer Protocol. It is the most widely used network protocol on the Internet. It is a rule that specifies the communication between the browser and the World Wide Web server in detail. Tornado asynchronous mechanism refers to a mechanism that when an asynchronous procedure call is issued, the caller cannot get the result immediately. After the component that actually handles the call completes, it informs the caller through the status, notification, and callback.

Further, the Tornado asynchronous mechanism is implemented based on AsyncHTTPClient. AsyncHTTPClient refers to an asynchronous framework that uses a thread pool to process and send requests.

Specifically, if the browser on the Internet device using the IP address as the IP address in the identification channel fails to access the first webpage corresponding to the first URL, or the browsing on the Internet device using the IP address as the IP address in the identification channel is not successful If the browser fails to access the homepage of the first website corresponding to the domain name, or the browser on the Internet device that uses the IP address as the IP address in the identification channel fails to traverse each second page of the first website, Tornado based on AsyncHTTPClient is used. The asynchronous mechanism assigns a new IP address in the identification information database to the identification channel, and returns to step S10. The new IP address refers to an IP address that has not been assigned to the identification channel.

In order to better understand step S60, an example is described below, and the specific expression is as follows:

For example, assuming that the Internet device is a personal computer and the browser is Internet Explorer, the IP address includes "42.55.173.190, port 80" and "53.34.219.40, port 8118". The first URL is http://news.163.com /18/0130/12/D9DA7M9S000181BT.html, the domain name is news.163.com, the new IP address is "121.31.100.15, port 8123", the identification information database is the first mysql database, and the identification channel includes A channel and B channel , Then, if the IE browser in the Internet device with the IP address of "42.55.173.190, port 80" in the A channel is accessed, access the corresponding http://news.163.com/18/0130/12/D9DA7M9S000181BT.html The first web page is unsuccessful, or the IE browser on the Internet device with an IP address of "42.55.173.190, port 80" is not successful in accessing the home page of the first website corresponding to news.163.com, or using IP The IE browser on the Internet device with the address "42.55.173.190, port 80" in the A channel is not successful in traversing the second pages of the first website, so it is not necessary to return the "53.34.219.40, port 8118" in the B channel. , Using Tornado asynchronous machine based on AsyncHTTPClient Assigning a first mysql database "121.31.100.15, port 8123" to the A channel, returns to step S10, "121.31.100.15, port 8123" refers to the IP address is not assigned through the A channel.

In the embodiment corresponding to FIG. 2, first, the first web page corresponding to the preset first web address is accessed by using the network identification information pre-assigned in the identification channel. If the first web page is accessed successfully, and the first web address is not Domain name, the first URL is parsed to obtain the domain name corresponding to the first URL, and then the network identification information is used to access the home page of the first website corresponding to the domain name. If the first web page is successfully accessed, and the first URL is a domain name or If the first page is accessed successfully, each second page of the first website is traversed. Next, after determining that each second page is successfully traversed, the content of the second page is parsed to obtain data to be crawled. Finally, if the first page is accessed If the webpage is unsuccessful, or the homepage is unsuccessful, or the traversal of each second webpage is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information to the identification channel and return to step S10, so that when one of the network identification information in use is invalid A new network identification information is immediately assigned, since the new network identification information comes from the target In the information database, each network identification information in the identification information database is network identification information that can successfully access network resources, thereby ensuring the stability of the network identification information, ensuring normal and orderly access to network resources, and thereby improving data crawling. Take the stability and efficiency.

In one embodiment, the data crawling method is applied in the financial industry. As shown in FIG. 3, in step S10, that is, before using the network identification information in the identification channel to access the first webpage corresponding to the preset first URL, the data crawling method further includes:

S70: Obtain network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;

Specifically, an IP address on the second website is extracted from a webpage corresponding to the second website according to a preset extraction manner, where the second website has more than one IP address.

It should be noted that the preset extraction method can be copying or screenshot, the second website can be the West Thorn proxy website, and the West Thorn proxy website refers to a website that provides domestic and international IP addresses. The preset extraction method and the specific content of the second website can be set according to actual application requirements, and there is no limitation here.

S80: Use the network identification information on the second website to access the third webpage corresponding to the preset second URL;

Specifically, first, the IP address of the Internet access device is set to the IP address extracted on the second website, and then the browser in the Internet access device is used to access the third webpage corresponding to the preset second URL.

It should be noted that the second URL may be http://www.xinhuanet.com/, and the specific content of the second URL may be set according to actual application requirements, and is not limited here.

S90: If the network identification information on the second website successfully accesses the third webpage corresponding to the preset second URL, save the network identification information on the second website to the identification information database;

Specifically, if the browser on the Internet device using the IP address extracted from the second website successfully accesses the third webpage corresponding to the preset second URL, the extracted IP from the second website will be The address is stored in the identification information database; if the browser on the Internet device whose IP address is the IP address extracted on the second website fails to access the third webpage corresponding to the preset second URL, it will be in the second The IP address extracted from the website is saved to an invalid database.

It should be noted that the preset second URL is a URL that can be connected normally.

In the embodiment corresponding to FIG. 3, the network identification information on the website is obtained, and the network identification information is used to access the webpage corresponding to the preset URL. If the access is successful, the network identification information is saved in the identification information database, so that The Internet obtains network identification information from all over the world from the network identification information agency website, thereby improving the convenience of obtaining network identification information.

In one embodiment, the data crawling method is applied in the financial industry. As shown in FIG. 4, in step S10, that is, using the network identification information in the identification channel to access the first webpage corresponding to the preset first URL specifically includes the following steps:

S101: Use network identification information to send an HTTP request to a server corresponding to a preset first URL;

In the embodiment of the present application, the HTTP request refers to a request message from a client to a server.

Specifically, first, the IP address of the Internet access device is set to the IP address in the identification channel, and then a browser in the Internet access device is used to send an HTTP request to a server corresponding to a preset first URL, where the HTTP request includes a target The identification information of the resource, and the identification information of the target resource uniquely identifies the target resource.

It should be noted that when the server receives the HTTP request, it verifies the identification information of the target resource in the HTTP request. After the verification is passed, the HTML file of the target resource is fed back to the sender. An HTML file is a file that can be read by a variety of web browsers to generate various types of information on a web page.

S102: If the HTML file fed back by the server according to the HTTP request is received, it is determined that access to the first webpage corresponding to the preset first URL using the network identification information in the identification channel is successful;

Specifically, if an HTML file of the target resource in the HTTP request that passes the verification feedback from the server corresponding to the preset first URL is received, it is determined that the IP address in the identification channel is used to access the first URL corresponding to the preset first URL. A webpage is successful; if the HTML file of the target resource in the HTTP request that is passed by the server corresponding to the preset first URL is not received, it is determined that the IP address in the identification channel is used to access the preset first URL. The first page was unsuccessful.

In the embodiment corresponding to FIG. 4, an HTTP request is sent to a server corresponding to a preset web address by using network identification information. If an HTML file returned by the server is received, it is determined that access to a web page corresponding to the preset web address is successful, so that Preview knows the amount of web content that needs to be crawled, and predicts the time required to crawl the web content based on the amount of web content that needs to be crawled, so that you can know in advance the time required to complete the data crawling, and then you can Ensure the progress of data crawling.

In one embodiment, the data crawling method is applied in the financial industry. As shown in FIG. 5, in step S40, traversing each second webpage of the first website specifically includes the following steps:

S401: Obtain each hyperlink tag of HTML in the first website;

In the embodiments of the present application, HTML and Chinese name is Hypertext Markup Language, which means text on a page that can contain pictures, links, and even non-text elements such as music and programs. The a tag refers to an HTML language tag. The a tag is a hyperlink tag and is used to link from one page to another. Hyperlink tags include more than one link target attribute. The link target attribute is the href attribute, which is the URL that specifies the hyperlink target. URL, full English name Uniform Resource Locator, Chinese name is Uniform Resource Locator, refers to the address of a standard resource on the Internet, and is a concise representation of the location and access method of a resource that can be obtained from the Internet.

Specifically, in the HTML page in the first website, each a tag is extracted, where one a tag includes more than one href attribute.

S402: Extract all link target attributes in each hyperlink tag;

Specifically, each href attribute is extracted from each a tag in the HTML page in the first website.

S403: Traverse the second webpage corresponding to each link target attribute by using the network identification information;

Specifically, first, set the IP address of the Internet access device to the IP address in the identification channel, and then use the browser in the Internet access device to traverse each href attribute corresponding to each a tag in the HTML page on the first website. Second page.

In order to better understand step S401, step S402, and step S403, an example is described below, and the specific expression is as follows:

For example, assuming the first website is Xinhuanet, an a tag in HTML is <a href="http://mongolian.news.cn/"target=""blank"title=""> </a>, and the internet device is Personal computer, the identification channel is D channel, the IP address is "219.149.46.151, port 3129", and the browser is IE browser. First, get <ahref = "" http: // mongolian. news.cn/"target = "" blank "title =" "> </a>, then extract <ahref =" "http://mongolian.news.cn/" target = "" blank "title =" "> < / a> in http://mongolian.news.cn/, set the personal computer's IP address to "219.149.46.151, port 3129" in the D channel, and finally, use the IE browser in the personal computer to traverse <a href="http://mongolian.news.cn/"target=""blank"title=""> </a> in the HTML page of Xinhuanet.com http://mongolian.news.cn/ Corresponding second page.

In the embodiment corresponding to FIG. 5, by obtaining each a tag of the HTML in the website, extracting all href attributes in the a tag, traversing the webpage corresponding to each href attribute, and traversing each webpage, it is possible to browse the webpage without omission. Content, which in turn improves the comprehensiveness of the data you need to crawl.

In one embodiment, the data crawling method is applied in the financial industry. As shown in FIG. 6, in step S50, the content of the second webpage is parsed according to a preset second parsing method, and the data to be crawled specifically includes the following steps:

S501: Remove the tag information of the second webpage to obtain an XML document.

Specifically, the <html> tag information and </ html> tag information of the second webpage are removed to obtain an XML document. The <html> tag information and </ html> tag information refer to HTML tags, and the HTML tags refer to hypertext. Markup language markup tags.

S502: Parse the XML document to obtain a document object tree in the XML document.

In the embodiment of the present application, the document object tree refers to a tree constructed by a Document object. The Document object refers to the document of the web page in the browser window. The document object tree contains more than one text node information.

Specifically, the XML document is parsed to obtain a document object tree in the XML document, where the document object tree includes more than one DOM node information.

It should be noted that DOM node information refers to a DOM object in an XML document, and a DOM object refers to a collection of nodes or information pieces organized in a hierarchical structure.

S503: Extract each text node information in the document object tree.

In this embodiment, the text node information is DOM node information.

Specifically, each DOM node information in the document object tree is extracted.

S504: The information of each text node is spliced according to a preset splicing method to obtain data to be scraped.

Specifically, the DOM node information is spliced according to a preset splicing method to obtain data to be crawled.

It should be noted that, according to the preset splicing method, the data information can be spliced in the order from top to bottom or the data information can be spliced in the order from left to right.

In order to better understand step S501, step S502, step S503, and step S504, an example is described below, and the specific expression is as follows:

For example, if the second webpage is a predefined weather webpage, weather webcast, and the Chinese name is weather forecast, then remove the <html> tag information and </ html> tag information of the weather webpage, and get <head> <title> Shenzhen </ title> </ head> <body> <h1> will have rain </ h1> <p> in the coming week </ p> </ body>, parse <head> <title> Shenzhen </ title> </ head> <body> <h1> will have rain </ h1> <p> in the coming week </ p> </ body>, you get <title> Shenzhen </ title>, <h1> will have harain rain < / h1> and <p> in the coming week </ p>, next, extract <title> Shenzhen </ title>, <h1> will have rain </ h1>, and <p> in the coming week </ p> > Shenzhen, will have, rain, and incoming week, and concatenate Shenzhen, will have, rain, and incoming week in order from top to bottom to get Shenzhen will will have raininin thecoming week, where <head>, </ head>, <title>, </ title>, <body>, </ body>, <h1>, </ h1>, <p>, and </ p> are HTML tags.

In the embodiment corresponding to FIG. 6, first, an XML document is obtained by removing tag information of a web page, and then the XML document is parsed to obtain a document object tree of the XML document. Next, each text node information in the document object tree is extracted. The information of each text node is stitched to obtain the data that needs to be crawled. By converting the XML document into a simple document object tree, and then loading the simple document object tree into memory, and then executing it according to the easy-to-run DOM object, it can be Simple and efficient parsing of DOM node information, thereby improving the speed of data crawling.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

In one embodiment, a data crawling device is provided, and the data crawling device corresponds to the data crawling method in the above embodiment one-to-one. As shown in FIG. 7, the data crawling device includes a first access module 71, a first analysis module 72, a second access module 73, a traversal module 74, a second analysis module 75, a dispatch module 76, a first acquisition module 77, Third access module 78 and storage module 79. The detailed description of each function module is as follows:

The first access module 71 is configured to access the first webpage corresponding to the preset first web address by using the network identification information in the identification channel, where the network identification information in the identification channel is allocated in advance by the identification information database, and the identification information database includes Multiple network identification information for successfully accessing network resources;

The first parsing module 72 is configured to: if the first webpage corresponding to the first web address is successfully accessed using the network identification information in the identification channel, and the first web address is a non-domain name, perform a first parsing operation on the first web address according to a preset first parsing method. Parse to obtain the domain name corresponding to the first URL;

The second access module 73 is configured to access the first page of the first website corresponding to the domain name by using the network identification information, wherein the first website includes more than one second web page, and the second web page includes the second web page content;

The traversal module 74 is configured to successfully access the first webpage corresponding to the first URL using the network identification information in the identification channel, and the first URL is a domain name, or use the network identification information to successfully access the homepage of the first website corresponding to the domain name, then Traverse each second webpage of the first website;

The second parsing module 75 is configured to parse the content of the second webpage according to a preset second parsing method if the traversal of each second webpage of the first website is successful, to obtain data that needs to be crawled.

Assigning module 76 is used to access the first webpage corresponding to the first website using network identification information, or to access the first page of the first website corresponding to the domain name using network identification information, or to traverse each second webpage of the first website If it is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel and trigger the first access module 71. The new network identification information refers to network identification information that has not been assigned to the identification channel.

Further, before the network identification information in the identification channel is used to access the first webpage corresponding to the preset first web address, the data crawling device further includes:

A first obtaining module 77, configured to obtain network identification information on a second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;

A third access module 78, configured to access the third webpage corresponding to the preset second web address by using the network identification information on the second website;

The saving module 79 is configured to save the network identification information on the second website to the identification information database if the network identification information on the second website successfully accesses the third webpage corresponding to the preset second URL.

Further, the first access module 71 includes:

A sending sub-module 711, configured to send an HTTP request to a server corresponding to a preset first URL by using network identification information;

A determining sub-module 712 is configured to determine that, if the HTML file fed back by the server according to the HTTP request is received, it is determined that the first webpage corresponding to the preset first URL is successfully accessed using the network identification information in the identification channel.

Further, the traversal module 74 includes:

An acquisition tag submodule 741, configured to acquire each hyperlink tag of HTML in the first website, where the hyperlink tag includes more than one link target attribute;

A first extraction submodule 742, configured to extract all link target attributes in each hyperlink tag;

The web page traversing submodule 743 is configured to traverse the second web page corresponding to each link target attribute by using the network identification information.

Further, the second parsing module 75 includes:

A removal sub-module 751, configured to remove tag information of the second webpage to obtain an XML document;

The parsing document sub-module 752 is used to parse the XML document to obtain a document object tree in the XML document, where the document object tree contains more than one text node information;

A second extraction submodule 753, configured to extract information of each text node in the document object tree;

The splicing sub-module 754 is configured to splice each text node information according to a preset splicing method to obtain data to be scraped.

For the specific limitation of the data crawling device, refer to the foregoing limitation on the data crawling method, and details are not described herein again. Each module in the above data crawling device may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store data related to the data crawling method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a data scraping method.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the data climbing of the foregoing embodiment is implemented. Take the steps of the method, for example, steps S10 to S60 shown in FIG. 2. Alternatively, when the processor executes the computer-readable instructions, the functions of the modules / units of the data crawling device in the foregoing embodiment are implemented, for example, the functions of modules 71 to 79 shown in FIG. 7. To avoid repetition, we will not repeat them here.

In one embodiment, a computer-readable storage medium is provided, the one or more non-volatile storage mediums storing computer-readable instructions, and the computer-readable instructions are executed by one or more processors. , So that when one or more processors execute computer-readable instructions, the data scraping method in the foregoing method embodiment is implemented, or the one or more non-volatile readable storage media storing computer-readable instructions are stored by a computer, When the read instruction is executed by one or more processors, the function of each module / unit in the data crawling device in the foregoing device embodiment is implemented when the one or more processors execute computer-readable instructions. To avoid repetition, we will not repeat them here. A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included Within the scope of this application.

Claims

A data crawling method, characterized in that the data crawling method includes:

The network identification information in the identification channel is used to access the first webpage corresponding to the preset first URL, wherein the network identification information in the identification channel is assigned in advance by an identification information database, which includes the network resources that can be successfully accessed Multiple network identification information;

If access to the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a non-domain name, the first parsing is performed on the first Parse the URL to obtain the domain name corresponding to the first URL;

Using the network identification information to access a homepage of a first website corresponding to the domain name, wherein the first website includes more than one second web page, and the second web page includes second web page content;

If accessing the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a domain name, or accessing the first corresponding to the domain name using the network identification information If the homepage of the website is successful, each second page of the first website is traversed;

If it is successful to traverse each second webpage of the first website, analyze the content of the second webpage according to a preset second parsing method to obtain data that needs to be crawled;

If it is unsuccessful to use the network identification information to access the first webpage corresponding to the first URL, or to use the network identification information to access the homepage of the first website corresponding to the domain name, or to traverse the first website If each of the second webpages is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and return to execute the access preset using the network identification information in the identification channel. In the step of the first webpage corresponding to the first web address, the new network identification information refers to network identification information that has not been assigned to the identification channel.
The data crawling method according to claim 1, wherein before the step of using the network identification information in the identification channel to access a first webpage corresponding to a preset first URL, the data crawling method further comprises: include:

Obtaining network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;

Using the network identification information on the second website to access a third webpage corresponding to a preset second URL;

If the access to the third webpage corresponding to the preset second web address using the network identification information on the second website is successful, the network identification information on the second website is stored in the identification information database.
The data crawling method according to claim 1, wherein the accessing the first webpage corresponding to the preset first web address by using the network identification information in the identification channel comprises:

Sending the HTTP request to the server corresponding to the preset first URL by using the network identification information;

If the HTML file fed back by the server according to the HTTP request is received, it is determined that the first webpage corresponding to the preset first web address is successfully accessed using the network identification information in the identification channel.
The data crawling method according to claim 1, wherein traversing each second webpage of the first website comprises:

Acquiring each hyperlink tag of HTML in the first website, wherein the hyperlink tag includes more than one link target attribute;

Extracting all of the link target attributes in each hyperlink tag;

The network identification information is used to traverse a second webpage corresponding to each of the link target attributes.
The data crawling method according to any one of claims 1 to 4, wherein the parsing the content of the second webpage according to a preset second parsing method to obtain data to be crawled comprises:

Removing the tag information of the second webpage to obtain an XML document;

Parse the XML document to obtain a document object tree in the XML document, where the document object tree includes more than one text node information;

Extracting information of each text node in the document object tree;

The information of each text node is stitched according to a preset stitching method to obtain data to be scraped.
A data crawling device, characterized in that the data crawling device includes:

A first access module, configured to access a first webpage corresponding to a preset first web address by using network identification information in an identification channel, wherein the network identification information in the identification channel is pre-assigned by an identification information database, and the identification information The library includes multiple network identification information that can successfully access network resources;

A first parsing module, configured to use the network identification information in the identification channel to successfully access the first webpage corresponding to the first URL, and the first URL is a non-domain name, according to a preset first Parsing the first URL to obtain a domain name corresponding to the first URL;

A second access module, configured to access the first page of the first website corresponding to the domain name by using the network identification information, wherein the first website includes more than one second web page, and the second web page includes second web page content;

The traversal module is configured to access the first webpage corresponding to the first URL successfully using the network identification information in the identification channel, and the first URL is a domain name, or access the network address using the network identification information. If the first page of the first website corresponding to the domain name is successful, the second pages of the first website are traversed;

A second parsing module, configured to parse the content of the second webpage according to a preset second parsing method if the traversal of each second webpage of the first website is successful, to obtain data that needs to be crawled;

An assignment module, configured to access the first webpage corresponding to the first web site using the network identification information if the first page corresponding to the domain name is unsuccessful using the network identification information or to traverse Each of the second webpages of the first website is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and trigger the first access module, the new The network identification information refers to network identification information that has not been assigned to the identification channel.
The data crawling device according to claim 6, wherein the data crawling device further comprises:

A first obtaining module, configured to obtain network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;

A third access module, configured to use the network identification information on the second website to access a third webpage corresponding to a preset second URL;

A saving module, configured to save network identification information on the second website to the identification information database if the third website corresponding to the preset second website is successfully accessed using the network identification information on the second website in.
The data crawling device according to claim 6, wherein the first access module comprises:

A sending submodule, configured to use the network identification information to send an HTTP request to a server corresponding to the preset first URL;

A determining submodule is configured to, if the HTML file fed back by the server according to the HTTP request is received, determine that the first webpage corresponding to the preset first URL is successfully accessed using the network identification information in the identification channel.
The data crawling device according to claim 6, wherein the traversal module comprises:

An acquisition tag submodule, configured to acquire each hyperlink tag of the HTML in the first website, wherein the hyperlink tag includes more than one link target attribute;

A first extraction submodule, configured to extract all the link target attributes in each hyperlink tag;

A web page traversing submodule is configured to use the network identification information to traverse a second web page corresponding to each of the link target attributes.
The data crawling device according to any one of claims 6 to 9, wherein the second parsing module comprises:

A removing sub-module for removing tag information of the second webpage to obtain an XML document;

The parsing document submodule is used to parse the XML document to obtain a document object tree in the XML document, where the document object tree includes more than one text node information;

A second extraction submodule, configured to extract information of each text node in the document object tree;

The splicing sub-module is configured to splice the respective text node information according to a preset splicing method to obtain data to be scraped.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

The network identification information in the identification channel is used to access the first webpage corresponding to the preset first URL, wherein the network identification information in the identification channel is assigned in advance by an identification information database, which includes the network resources that can be successfully accessed Multiple network identification information;

If access to the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a non-domain name, the first parsing is performed on the first Parse the URL to obtain the domain name corresponding to the first URL;

Using the network identification information to access a homepage of a first website corresponding to the domain name, wherein the first website includes more than one second web page, and the second web page includes second web page content;

If accessing the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a domain name, or accessing the first corresponding to the domain name using the network identification information If the homepage of the website is successful, each second page of the first website is traversed;

If it is successful to traverse each second webpage of the first website, analyze the content of the second webpage according to a preset second parsing method to obtain data that needs to be crawled;

If it is unsuccessful to use the network identification information to access the first webpage corresponding to the first URL, or to use the network identification information to access the homepage of the first website corresponding to the domain name, or to traverse the first website If each of the second webpages is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and return to execute the access preset using the network identification information in the identification channel In the step of the first webpage corresponding to the first web address, the new network identification information refers to network identification information that has not been assigned to the identification channel.
The computer device according to claim 11, wherein before the step of accessing a first webpage corresponding to a preset first web address by using the network identification information in the identification channel, the processor executes the computer may The following steps are also implemented when reading instructions:

Obtaining network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;

Using the network identification information on the second website to access a third webpage corresponding to a preset second URL;

If the access to the third webpage corresponding to the preset second web address using the network identification information on the second website is successful, the network identification information on the second website is stored in the identification information database.
The computer device according to claim 11, wherein the accessing the first webpage corresponding to the preset first web address by using the network identification information in the identification channel comprises:

Sending the HTTP request to the server corresponding to the preset first URL by using the network identification information;

If the HTML file fed back by the server according to the HTTP request is received, it is determined that the first webpage corresponding to the preset first web address is successfully accessed using the network identification information in the identification channel.
The computer device of claim 11, wherein each of the second web pages traversing the first website comprises:

Acquiring each hyperlink tag of HTML in the first website, wherein the hyperlink tag includes more than one link target attribute;

Extracting all of the link target attributes in each hyperlink tag;

The network identification information is used to traverse a second webpage corresponding to each of the link target attributes.
The computer device according to any one of claims 11 to 14, wherein the parsing the content of the second webpage according to a preset second parsing method to obtain data to be crawled comprises:

Removing the tag information of the second webpage to obtain an XML document;

Parse the XML document to obtain a document object tree in the XML document, wherein the document object tree includes more than one text node information;

Extracting information of each text node in the document object tree;

The information of each text node is stitched according to a preset stitching method to obtain data to be scraped.
One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:

The network identification information in the identification channel is used to access the first webpage corresponding to the preset first URL, wherein the network identification information in the identification channel is assigned in advance by an identification information database, which includes the network resources that can be successfully accessed Multiple network identification information;

If access to the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a non-domain name, the first parsing is performed on the first Parse the URL to obtain the domain name corresponding to the first URL;

Using the network identification information to access a homepage of a first website corresponding to the domain name, wherein the first website includes more than one second web page, and the second web page includes second web page content;

If accessing the first webpage corresponding to the first URL using the network identification information in the identification channel is successful, and the first URL is a domain name, or accessing the first corresponding to the domain name using the network identification information If the homepage of the website is successful, each second page of the first website is traversed;

If it is successful to traverse each second webpage of the first website, analyze the content of the second webpage according to a preset second parsing method to obtain data that needs to be crawled;

If it is unsuccessful to use the network identification information to access the first webpage corresponding to the first URL, or to use the network identification information to access the homepage of the first website corresponding to the domain name, or to traverse the first website If each of the second webpages is unsuccessful, the Tornado asynchronous mechanism is used to assign new network identification information in the identification information database to the identification channel, and return to execute the access preset using the network identification information in the identification channel. In the step of the first webpage corresponding to the first web address, the new network identification information refers to network identification information that has not been assigned to the identification channel.
The non-volatile readable storage medium according to claim 16, wherein before the step of accessing a first webpage corresponding to a preset first web address by using network identification information in an identification channel, the computer When the readable instruction is executed by one or more processors, the one or more processors further perform the following steps:

Obtaining network identification information on the second website from a webpage corresponding to the second website, where more than one network identification information exists on the second website;

Using the network identification information on the second website to access a third webpage corresponding to a preset second URL;

If the access to the third webpage corresponding to the preset second web address using the network identification information on the second website is successful, the network identification information on the second website is stored in the identification information database.
The non-volatile readable storage medium according to claim 16, wherein the accessing the first webpage corresponding to the preset first web address by using the network identification information in the identification channel comprises:

Sending the HTTP request to the server corresponding to the preset first URL by using the network identification information;

If the HTML file fed back by the server according to the HTTP request is received, it is determined that the first webpage corresponding to the preset first web address is successfully accessed using the network identification information in the identification channel.
The non-volatile readable storage medium of claim 16, wherein each of the second webpages traversing the first website comprises:

Acquiring each hyperlink tag of HTML in the first website, wherein the hyperlink tag includes more than one link target attribute;

Extracting all of the link target attributes in each hyperlink tag;

The network identification information is used to traverse a second webpage corresponding to each of the link target attributes.
The non-volatile readable storage medium according to any one of claims 16 to 19, wherein the second webpage content is parsed according to a preset second parsing manner to obtain crawling requirements The data includes:

Removing the tag information of the second webpage to obtain an XML document;

Parse the XML document to obtain a document object tree in the XML document, wherein the document object tree includes more than one text node information;

Extracting information of each text node in the document object tree;

The information of each text node is stitched according to a preset stitching method to obtain data to be scraped.