CN113515682A

CN113515682A - Data crawling method and device, computer equipment and storage medium

Info

Publication number: CN113515682A
Application number: CN202110544655.5A
Authority: CN
Inventors: 贾波涛
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2021-10-19

Abstract

The application provides a data crawling method, a data crawling device, computer equipment and a storage medium, wherein the method comprises the following steps: reading and analyzing a crawler configuration file to obtain a uniform resource locator of an initial target page serving as a current page; acquiring target data of a previous page of a current page, generating and sending an access request carrying the target data of the previous page according to a uniform resource locator of the current page; extracting target data of the current page from response data of the access request; judging whether the current page has a next page or not; if the next page exists, acquiring a uniform resource locator of the next page, taking the next page as the current page, and circulating the steps until the next page does not exist in the current page; and acquiring final target data according to the target data obtained from the final page. According to the method and the device, each target page can be automatically drilled down, the page data of each level can be extracted and analyzed, and meanwhile, the data output can be automatically merged and integrated.

Description

Data crawling method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data crawling method and apparatus, a computer device, and a storage medium.

Background

The crawling methods commonly used by the crawlers based on the Scapy framework at present are divided into two methods.

One is a crawler written based on the inheritance Spider class, and the crawler has the advantages that data splicing among different pages can be achieved, but when a website is drilled and crawled down, a function must be written manually to call crawling, if a website has a plurality of hierarchical pages, an extraction function must be written in each hierarchical page, a large number of extraction functions must be written to perform splicing extraction, and the effects of automatic data grabbing and splicing cannot be achieved obviously.

The second is a crawler based on inheritance CrawSpider, which has the advantages of automatic drilling, automatic extraction of the link of each level page and final acquisition of the data of the final page.

Disclosure of Invention

The technical problems that in the prior art, automatic page drilling and target data of all pages to be crawled cannot be achieved simultaneously in a data crawling process are solved. The application provides a data crawling method, a data crawling device, computer equipment and a storage medium, and the method and the device are mainly used for automatically drilling down to each target page in data crawling, extracting and analyzing page data of each level, and simultaneously automatically combining and integrating data output.

In order to achieve the above object, the present application provides a data crawling method, including:

reading and analyzing a crawler configuration file to obtain website crawling parameters;

acquiring a uniform resource locator of an initial target page of a target site from website crawling parameters, and taking the initial target page as a current page;

acquiring target data crawled by a previous page of a current page, generating an access request carrying the target data of the previous page according to a uniform resource locator of the current page, and sending the access request to a server;

receiving response data returned by the server after responding to the access request of the current page, and extracting target data of the current page from the response data;

judging whether the current page has a next page or not according to the website crawling parameters;

if the current page has the next page, acquiring a uniform resource locator of the next page, taking the next page as the current page, and circularly acquiring target data crawled by the previous page of the current page until judging whether the current page has the next page or not until the current page does not have the next page;

and if the current page does not have the next page, performing data processing on the target data obtained by the current page to obtain final target data.

In addition, in order to achieve the above object, the present application also provides a data crawling apparatus, including:

the loading module is used for reading and analyzing the crawler configuration file so as to obtain website crawling parameters;

the target positioning module is used for acquiring a uniform resource locator of an initial target page of a target site from the website crawling parameters and taking the initial target page as a current page;

the request module is used for acquiring target data crawled by a previous page of a current page, generating an access request carrying the target data of the previous page according to a uniform resource locator of the current page, and sending the access request to a server;

the target data extraction module is used for receiving response data returned after the server responds to the access request of the current page and extracting the target data of the current page from the response data;

the judging module is used for judging whether the current page has a next page or not according to the website crawling parameters;

the circulating module is used for acquiring the uniform resource locator of the next page if the next page exists in the current page, and circulating the next page to the request module to the judging module by taking the next page as the current page until the next page does not exist in the current page;

and the ending module is used for carrying out data processing on the target data obtained by the current page to obtain final target data if the next page does not exist in the current page.

To achieve the above object, the present application further provides a computer device, which includes a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, wherein the processor executes the computer readable instructions to perform the steps of the data crawling method according to any one of the above methods.

To achieve the above object, the present application further provides a computer readable storage medium, on which computer readable instructions are stored, and when the computer readable instructions are executed by a processor, the processor is caused to execute the steps of the data crawling method according to any one of the above items.

According to the data crawling method, the data crawling device, the computer equipment and the storage medium, a starting target page of a target site and uniform resource locators of the starting target page are defined through a crawler configuration file, whether each page has a previous page and/or a next page or not is/are further defined, automatic drilling is further achieved, target data crawled from the previous page are transmitted to target data of the next page through an access request, the target data are transmitted in a recursion mode in the target data of the target pages on one layer and one layer, and therefore the finally obtained target data comprise the target data of all the target pages. The method realizes automatic drilling-down through universal Uniform Resource Locator (URL) extraction on the premise of automatic crawler configuration, can automatically analyze and extract data of any complex page, avoids the problem of sub-page analysis which is finished through handwriting codes in the past, and realizes data crawling, integration and crawling. The function is not enough when perfect among the prior art and crawling, breaks the problem that can only parse last layer webpage in the past, realizes under the prerequisite that automatic link drawed and drilled down, can also guarantee that the data of every layer page can both be taken, greatly richenes the functionality of crawler, has realized no longer needing the hand-written code, and can accomplish crawling of each webpage through the configuration and get, has satisfied the application scene of most crawlers. Plays an important role in solving the problem of enterprise-level data crawling, and greatly saves time and labor cost.

Drawings

FIG. 1 is a diagram of an application environment of a data crawling method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a data crawling method according to an embodiment of the present application;

FIG. 3 is a block diagram of a data crawling apparatus according to an embodiment of the present application;

fig. 4 is a block diagram of an internal structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Fig. 1 is an application environment diagram of a data crawling method in an embodiment of the present application. Referring to fig. 1, the data crawling method is applied to a data crawling system. The data crawling system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

The terminal 110 reads and analyzes the crawler configuration file to obtain website crawling parameters; the terminal 110 acquires a uniform resource locator of an initial target page of a target site from the website crawling parameters, and takes the initial target page as a current page; the terminal 110 acquires target data crawled by a previous page of a current page, generates an access request carrying the target data of the previous page according to a uniform resource locator of the current page, and sends the access request to the server 120 by the terminal 110; the terminal 110 receives response data returned after the server 120 responds to the access request of the current page, and extracts target data of the current page from the response data; the terminal 110 judges whether the current page has a next page according to the website crawling parameter; if the current page has the next page, the terminal 110 obtains the uniform resource locator of the next page, takes the next page as the current page, and circularly obtains target data crawled by the previous page of the current page until judging whether the current page has the next page or not until the current page does not have the next page; if the current page does not have the next page, the terminal 110 performs data processing on the target data obtained by the current page to obtain final target data.

Fig. 2 is a schematic flow chart of a data crawling method according to an embodiment of the present application. Referring to fig. 2, the method is applied to a terminal, and based on a script framework, a crawler file created and written by an engineer in advance is installed in the terminal, and the terminal implements the following steps S100 to S700 when running the crawler file.

S100: and reading and analyzing the crawler configuration file to obtain the website crawling parameters.

Specifically, a crawler file is a computer readable instruction that automatically captures internet information according to certain rules. The crawler file further comprises a starting class written according to the technical scheme of the application after the script project of the crawler file is created.

The crawler configuration file is a file written by engineering personnel according to an actual application scene and a crawling purpose and used for defining crawling information (such as rules, parameters and the like). The crawler configuration file of the present application contains website crawling parameters. The website crawler parameters define an initial target page and other target pages to be crawled by a target website (target site), a branching relation and a top-bottom relation between the target pages, a Uniform Resource Locator (URL) of the initial target page, a target data related parameter of each target page and a uniform resource locator related parameter of each other target page. The crawler profile may be modified by the engineer.

The target data related parameters may specifically include, but are not limited to, an extraction field of the data to be extracted, an extraction path corresponding to an extraction value of the extraction field, and the like. The target data-related parameters correspond to defining which data to extract from the target page.

The parameters related to the uniform resource locator of the target page may specifically include, but are not limited to, an extraction path of the uniform resource locator of the target page.

Wherein, Uniform Resource Locator (URL) is used for the purpose of searching for the Resource. It is a WWW uniform resource locator, which refers to the web page address in the network.

In a specific embodiment, the crawler configuration file may be in a Json format, and may also be in other available formats such as an Excel format.

The terminal reads and analyzes the crawler configuration file by operating the crawler file so as to transmit the website crawling parameters into the starting class of the crawler file, and the crawler file can call related data from the website crawling parameters in the crawling data process.

S200: and acquiring a uniform resource locator of an initial target page of the target site from the website crawling parameters, and taking the initial target page as a current page.

In particular, in practical applications, data generally needs to be crawled from a plurality of pages of a target site, that is, at least one of the target pages is generally required.

The starting target page can be the first page of the target site or the webpage, and can also be a certain page in the self-defined target site or the webpage. The initial target page is the first page of the data to be crawled, and the following target page is a page turning backwards from the first page according to the webpage structure of the target site or website.

The uniform resource locator of the starting target page is configured in advance in the crawler configuration file. Of course, it may also be provided by the engineer (e.g., via input means) while the crawler file is running.

S300: the method comprises the steps of obtaining target data crawled by a previous page of a current page, generating an access request carrying the target data of the previous page according to a uniform resource locator of the current page, and sending the access request to a server.

Specifically, in order to ensure that the target data of each level of target page can be transmitted to the next level of target page or the next level of target page can inherit the target data of the previous level of target page, the access request of the current page carries the target data of the previous page, and more specifically, the target data of the previous page of the current page is assigned to the related parameters in the access request, so that the access request of the current page can carry the target data of the previous page. And then sending the access request of the current page to the server. The access request is a page access request, which may be denoted by a request.

S400: and receiving response data returned by the server after responding to the access request of the current page, and extracting the target data of the current page from the response data.

Specifically, after the server responds to the page access request corresponding to the current page, response data of the access request is obtained, the response data is response body response, and the response data response is returned to the terminal. Since the access request of the current page in step S300 carries the target data of the previous page, the response data response of the current page also carries the target data of the previous page of the current page.

Because the target data related parameters of each target page are defined or configured in the crawler configuration file, after the terminal receives the returned response data response, the target data of the current page can be extracted from the response data response corresponding to the current page through the crawler file according to the target data related parameters of the current page in the website crawling parameters. Because the response data of the current page carries the target data of the previous page, the target data of the current page containing the target data of the previous page can be extracted from the response data of the current page. That is, the target data of the current page includes the target data of the previous page.

S500: and judging whether the current page has a next page or not according to the website crawling parameters.

Specifically, because the crawler configuration file configures at least one target page, the starting target page has no previous page and has a next page; the last page has no next page and has a previous page; among other target pages, there may be more than one next page, there may be no next page, and there may be only one next page.

Specifically, the relevant parameters of the next page of the current page can be obtained according to the website crawling parameters, and if the relevant parameters of the next page of the current page do not exist in the website crawling parameters, it is determined that the next page does not exist in the current page; and if the related parameters of the next page of the current page exist in the website crawling parameters, judging that the next page exists in the current page.

S600: if the current page has the next page, acquiring the uniform resource locator of the next page, taking the next page as the current page, and repeating the steps S300-S500 until the next page does not exist in the current page.

Specifically, if the current page has a next page, the crawler file needs to further crawl target data of the next page of the current page, so that the uniform resource locator of the next page of the current page needs to be acquired, then the next page of the current page is taken as the current page, and the steps S300 to S500 are executed in a loop, and when the current page does not have the next page, the current page is the last target page.

The uniform resource locator of the next page may be preconfigured in the crawler configuration file by the engineering personnel or may be currently generated.

As the drill-down proceeds, the current page is changed. Equivalently, according to the network framework of the target site, drilling down one by one to traverse each target page, and acquiring target data of each target page.

S700: and if the current page does not have the next page, performing data processing on the target data obtained by the current page to obtain final target data.

Specifically, if the current page has no next page, the current page is the last target page, and the target data of the previous page is transmitted into the target data of the current page in the recursive manner, so that all the target data of all the target pages are included in the target data of the last target page. Therefore, the final target data can be obtained by performing data processing on the target data of the current page.

The data processing may include a stitching merge process. For example, the target data of the last page may be merged by a yield function. Because the target data obtained from the last target page may be messy, the data needs to be spliced and merged, so that the final target data has certain logicality and orderliness, and has better readability and reference value. The resulting final target data may be stored in the database via the pipeline of the crawler file.

The application executes steps S300-S600 in a loop to automatically drill down to the next page and crawl the target data of each target page of the target site. And repeating the steps in a circulating way until the last page is drilled, wherein the final target data comprises target data crawled by all target pages and is complete data after splicing and integration.

The embodiment overcomes the inherent defects of the Scapy framework, so that the Scapy can be extracted through a universal Uniform Resource Locator (URL) on the premise of automatic configuration, automatic drilling is realized, the data of any page is automatically analyzed and extracted, the problem of sub-page analysis completed through handwritten codes in the past is solved, and data crawling integration crawling is realized. Perfecting the function deficiency of the script when the Spider class and the CrawlSpider crawl, breaking the problem that the CrawlSpider can only analyze the last layer of webpage in the past, realizing under the prerequisite of automatic link extraction and drill-down, the data that can also guarantee every layer of page can both be taken, greatly enriching the script functionality, realized no longer needing the handwritten code when using the Spider class, and can accomplish the crawling of each target webpage through configuration crawler file, satisfied the application scene of most crawlers. Plays an important role in solving the problem of enterprise-level data crawling, and greatly saves time and labor cost.

In one embodiment, the step S400 of extracting the target data of the current page from the response data specifically includes the following steps:

and acquiring an extraction field and an extraction value path of the current page from the website crawling parameters.

Specifically, the extraction path corresponding to the extraction value of the extraction field is an extraction value path, which is used for extracting the extraction value corresponding to the extraction field. The website crawling parameters comprise extraction fields of data to be extracted of each target page and extraction value paths of the extraction fields. The website crawling parameters are used for guiding the crawler file to crawl which target data of each target page, so that the crawler file sequentially accesses and crawls the target data in each target page according to the layout or structure of the website or the website, and the integrated final target data is obtained.

The extracted value path is specifically an xpath path. XPath (XML Path language), an XML Path language, is a language that is used to find information in XML documents. Originally intended for searching XML documents, but the same applies to searching HTML documents.

A target page may be configured with 0, 1, or multiple extraction fields, all extraction fields of the same target page are stored in the same dictionary, if a certain target page does not need to extract data, the number of configured extraction fields is 0, and at this time, an empty dictionary needs to be configured for the target page to make a placeholder, so as to ensure that the target page and the dictionary correspond one-to-one. The extraction field is stored in the form of a key-value pair (key-value) in the dictionary.

The crawler file can judge whether the current page needs to extract data or not according to whether the extraction field and the extraction value path of the current page exist in the website crawling parameters or not. The extraction field and the extraction value path of the target page without data extraction are all null.

And extracting the target data of the current page from the response data of the current page according to the extraction field and the extraction value path of the current page.

Specifically, if the current page needs to extract data, the crawler file extracts the target data of the current page from the response data response of the current page according to the extraction field of the current page and the extraction value path of the extraction field. And the target data of the current page comprises the target data of the previous page.

The target page of treating the crawler can be freely defined through the crawler configuration file in the embodiment, and the data to be crawled of the target page are defined through the extraction value path, so that the method is very flexible, the requirements can be modified according to actual conditions, the method is simple and convenient, the purposes of automatically drilling down and crawling the target data of all target pages can be achieved by modifying the configuration file setting for different websites or sites, and the applicability is wide.

Among a large number of web pages of a web site, some web pages contain data having a plurality of pieces of parallel data such as tables. For example, a table of the achievements of students in a class may have the achievements of a plurality of students, and the achievements of the students are parallel data.

If the scores of all students in the table are expected to be found in advance, the score of each student can be set into a crawler file as an extraction field when the crawler configuration file is written.

The method can be manually set when a small amount of data exists, but if the amount of data to be extracted is large, the method is particularly troublesome. Therefore, the score of any student can be used as an extraction field, and an extraction value path of the score of the student can be configured in the crawling configuration file. The workload of the configuration file is effectively reduced. The extracted field may be a performance field of one of the plurality of students. The extracted value path is also the extracted value path for the student's performance field.

On this basis, the extracting of the target data of the current page from the response data of the current page according to the extracting field and the extracting value path of the current page specifically includes the following steps:

if the data to be extracted corresponding to any extraction field of the current page comprises single data, extracting an extraction value corresponding to each extraction field from the response data of the current page according to the extraction field and the extraction value path of the current page, and taking all the extracted extraction values and the corresponding extraction fields as target data of the current page.

Specifically, the current page may include a plurality of extraction fields, and if all the extraction fields include a single piece of data, the extraction value of each extraction field is extracted from the response data of the current page according to the extraction value path, and all the extracted extraction values and the corresponding extraction fields are the target data of the current page.

And if the data to be extracted corresponding to at least one extraction field in the extraction field of the current page comprises a plurality of pieces of data, taking the extraction field comprising the plurality of pieces of data as a target field.

Specifically, which extracted fields to-be-extracted data in the extracted fields of the current page include multiple pieces of data can be judged according to the value range of the extracted fields in the website crawling parameter. Each extraction field containing a plurality of pieces of data is a target field.

And obtaining a parent extraction path of the extraction value path of the target field by traversing.

Extracting an extraction value corresponding to each extraction field from response data of the current page according to the extraction field of the current page, the extraction value paths of other extraction fields and the father extraction path, and taking all the extracted extraction values and the corresponding extraction fields as target data of the current page.

Specifically, the target field is one of the subfields arbitrarily selected by an engineer when writing a configuration file, and the extracted value path is also an extracted value path of the subfield. For example, the score of a certain student and the score extracted value path. The student's performance is actually a subfield of the parent field of performance. Finding the parent extraction path of the target field can locate all the subfields belonging to the same parent extraction path as the subfield, and extract the extraction values of all the subfields. At this time, the target data of the current page includes each other extraction field and the corresponding extraction value, and all subfields and their corresponding extraction values that belong to the same parent extraction path as the target field. Wherein both the sub-field and the target field belong to the extracted field. And a plurality of subfields are arranged under the same parent extraction path, one subfield is used as a target field, the parent extraction path can be obtained according to the extraction value path of the target field, and the extraction values of other subfields can be extracted according to the parent extraction path. By the method and the device, writing of the configuration file can be reduced, and simultaneously, all the extraction values of the fields containing a plurality of pieces of data and a single piece of data are extracted to obtain the complete target data of the current page.

The specific content of the extracted value is determined according to the actual application scene, for example, if the extracted field is the student score, the extracted value is the numerical value of the student score; if the extraction field is the praise number, the extraction value is the praise number value; and if the extracted field is the book name, extracting the value as the book name and the like.

In one embodiment, the website crawling parameters further comprise: and extracting data corresponding to the extracting field containing a plurality of pieces of data.

Specifically, if only the scores of some students in the table in the above example are to be extracted, the scores of the students can be set and extracted according to actual needs when the crawler configuration file is written. For example, the performance of students in the range of 90-100 extracted scores may be set in the configuration of the current page of the crawler profile. That is, the data extraction range is set for the extraction value of the extraction field, so that it is not necessary to set the extraction field and the extraction value path for each value within the data extraction range. The workload of the configuration file is effectively reduced, and data in any range can be extracted. The extraction field at this time may be, for example, a performance field of one of the plurality of students. The extracted value path is also the extracted value path for the student's performance field.

On this basis, the above-mentioned extracting the target data of the current page from the response data of the current page according to the extracted field and the extracted value path of the current page specifically includes the following steps:

and judging whether each extraction field of the current page contains single data or multiple pieces of data according to whether each extraction field of the current page in the website crawling parameters has a data extraction range.

The extracting value corresponding to each extracting field is extracted from the response data of the current page according to the extracting field of the current page, the extracting value path of other extracting fields and the father extracting path, and all the extracted extracting values and the corresponding extracting fields are used as the target data of the current page, including:

extracting the corresponding extraction value of each other extraction field from the response data of the current page according to the other extraction fields except the target field of the current page and the corresponding extraction value paths,

extracting all data in the value range of the target field as the extraction value of the target field according to the target field of the current page and the corresponding father extraction path,

and taking all extracted values and corresponding extracted fields of the current page as target data of the current page.

Specifically, the data to be extracted in the extracted fields except the target field all include single data, so that the single data corresponding to the extracted fields can be extracted according to the extracted value path without acquiring a parent extracted path of the extracted value path.

The target fields are all extraction fields containing a plurality of pieces of data, at this time, a parent extraction path of each target field needs to be found, and values corresponding to subfields in a value range under the parent extraction path need to be extracted according to actual needs because the parent extraction path may have values corresponding to a plurality of subfields. At this time, the target data of the current page includes other extraction fields, corresponding extraction values, subfields within a value range, and corresponding extraction values. Other data which are not in the data extraction range do not need to be extracted, so that the workload is reduced, the data extraction speed is increased, and the interference of irrelevant data on the final target data is reduced. The target field may or may not be within the range.

Steps S300-S400 implement transferring the target data of the previous page to the target data of the current page, so that the target data of the current page carries the target data of the previous page. The specific implementation process is as follows:

target data extraction of the target page can adopt a built-in function Itemloader of Scapy. And dynamically loading a field of a current target page in the crawler configuration file and an extraction value path corresponding to the field by using an Itemloader built in Scapy function, extracting target data from response data response of the current page, and filling the target data into the Item. The ItemLoader container provided by Scapy may configure the extraction rules for the various fields in the Item. Original data are analyzed through functions, and Item fields are assigned, so that the method is very convenient and fast. Item provides a container to hold the scratch data, while Item loader provides a mechanism to fill the container. Item Loaders provides a convenient way to populate grabbed Items. Although Items can be populated using an on-board class dictionary form API, Item Loaders provides a more convenient API that can analyze raw data and assign values to Items. Itemloader provides a flexible and efficient mechanism, can be more conveniently extended and rewritten by a pointer or source format (HTML, XML, etc.), and is easier to maintain, especially when the analysis rules are particularly complex and numerous.

Specifically, Itemloader includes an input processor and an output processor.

After the input processor of ItemLoader receives the response data response and extracts the data (by add _ xpath (), add _ css (), or add _ value () method), the results of the input processor are collected and saved in ItemLoader (but not yet assigned to the Item).

After all the data is collected, Itemloader calls the output processor to process the previously collected data, then calls the ItemLoader. The result of the output processor is the final value assigned to Item.

The property of Item is field, dictionary format { ' key ': value ' }, dictionary is another variable container model, and can store any type of object. The crawler file transfers the target data in Item into meta. The attribute of Meta is also a dictionary. The meta parameter in the access request is used to pass information to the next function, which may be of any type, such as a value, string, list, dictionary, method, and the like. When a multi-level request is constructed, certain data of the time needs to be stored in a next request for use, at this time, target data can be stored in meta parameters in a dictionary form, and when a crawler file sends a next access request with meta to a server, a meta tag is also carried in next response data response returned by the server, namely the meta tag is transmitted along with the next response data response. The crawler file also inherits meta data in the next access request from the target data extracted from the next response data response. And the target data is transferred and inherited through meta layer by layer. Therefore, the target data obtained from the last layer of page contains the target data extracted from all the previous layers. The automatic splicing and integration of the crawling data is realized.

According to the method and the device, automatic drilling down to each target page is realized, data of each level page are extracted and analyzed, and meanwhile, data output can be automatically merged and integrated.

When crawling a website, the data to be crawled is typically not all on one page, each page containing a portion of the data and links to other pages. For example, the article information can be acquired only by acquiring the article title, the article URL and the author name of the article on the list page, and if the detailed content of the article and the comment of the article are to be acquired, the article information can be acquired only by going to the detail page of the article. This requires drilling down to the next page to acquire the data. However, to drill down to the next page, the access request of the next page must be sent to the server, and the Uniform Resource Locator (URL) of the next page must be acquired first when the access request of the next page is sent.

In one embodiment, step S600 specifically includes the following steps:

if the next page does not have preset association with the current page, acquiring a page path of the next page from the website crawling parameters;

and extracting the uniform resource locator of the next page according to the page path of the next page and the response data of the current page.

In particular, the website crawling parameters are also used to direct the crawler file how to obtain the URL of each target page. A Uniform Resource Locator (URL) is an access address of a web page, and the URL of the web page is obtained to request a server to access the web page. The page path is specifically an xpath path, which is a path for accessing a page. And acquiring the URL of the page according to the page path.

Page paths corresponding to all target pages are stored in the first list, each target page is also provided with a dictionary, and the dictionaries corresponding to all the target pages are stored in the second list. The first list and the second list are equal in length, and the stored data correspond to one another, namely, the target page corresponds to the dictionary one by one, and the page also corresponds to the dictionary one by one.

The preset association includes that the page types are the same. For example, all the content displayed on one page is displayed on a plurality of page turning pages, and in this case, the page turning pages all belong to the same type of page. More specifically, for example, a post has many reply contents thereunder, and a page cannot display all reply contents, so that pages need to be turned to obtain all reply contents thereunder, and the page-turning pages are the same type of pages.

If the current page and the next page do not belong to the same type of page, the engineering staff can set a page path of the next page in the crawler configuration file, and can extract a Uniform Resource Locator (URL) of the next page from the response data of the current page according to the path of the next page. There may be multiple next pages of the current page, and thus, Uniform Resource Locators (URLs) for the multiple next pages may be extracted.

In one embodiment, step S600 further includes the following steps:

if the next page is in preset association with the current page, taking the page path of the current page as the page path of the next page;

Specifically, if the current page and the next page belong to the same type of page, the page path of the current page is also the page path of the next page. And extracting the uniform resource locator of the next page from the response data of the current page according to the page path of the next page. At this time, the next page is the page turning page of the current page. For example, if the target data is to be extracted from the entire reply content under an individual post, a Uniform Resource Locator (URL) for each page turned page needs to be obtained.

In a specific embodiment, the uniform resource locator of the next page is extracted according to the page path of the next page and the response data of the current page, and the specific implementation manner may be as follows:

in this embodiment, a built-in function LinkExtrator is used to automatically extract the URL of the next page, so as to automatically drill down to the next page. LinkExtractor is well suited for whole-site crawling, uses LinkExtractor to extract links on the website, LinkExtractor: connection extractor: helps us extract the specified links from the response object.

The method comprises the following specific steps:

linkExtractor was first introduced using from scratch. Instantiating a LinkExtractor object, and reading and analyzing a crawler configuration file to acquire various parameters and specified extraction rules when the LinkExtractor object is instantiated.

A LinkExtractor object is created that describes the extraction rules using constructor parameters, here the parameters passed to the restart _ XPaths using XPaths selector expressions. That is, the page path of the next page is obtained from the crawler configuration file and is passed to the restart _ xpaths.

Calling an extract _ links method of a Link extract object, transmitting response data response of a current page, extracting links from the page contained in the response data response according to an extraction rule described by a created object, and returning a list, wherein the links are extracted one by one inside the list.

The link object has two attributes, url: extracted link, text: textual description of the link. Therefore, the link object contains the URL of the next page.

By the method, the Uniform Resource Locator (URL) of the next page of the current page can be extracted from the response data of the current page according to the page path of the next page.

In one embodiment, the crawler profile also configures crawler setup parameters. The crawler setting parameters include: storage information, database addresses, ports, storage tables. The data crawling method further comprises the following steps:

and crawling the operating environment according to the crawler setting parameter configuration data in the crawler configuration file.

The operation environment is the operation environment of the data crawling system. Of course, the runtime environment may also be preconfigured in the crawler file. And crawler setting parameters are configured in the crawler configuration file, so that engineering personnel can conveniently modify the configuration of the operating environment.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

FIG. 3 is a block diagram of a data crawling apparatus according to an embodiment of the present application; the data crawling apparatus includes:

the loading module 100 is configured to read and parse a crawler configuration file to obtain a website crawling parameter;

the target positioning module 200 is configured to obtain a uniform resource locator of an initial target page of a target site from the website crawling parameters, and use the initial target page as a current page;

the request module 300 is configured to obtain target data crawled from a previous page of a current page, generate an access request carrying the target data of the previous page according to a uniform resource locator of the current page, and send the access request to a server.

The target data extraction module 400 is configured to receive response data returned by the server after responding to the access request of the current page, and extract target data of the current page from the response data, where the target data of the current page includes target data of a previous page;

the judging module 500 is configured to judge whether a next page exists in a current page according to the website crawling parameter;

a circulation module 600, configured to, if the current page has a next page, obtain a uniform resource locator of the next page, and circulate the next page to the request module to the determination module as the current page until the current page does not have the next page;

the ending module 700 is configured to, if there is no next page in the current page, take the target data obtained by the current page as final target data.

In one embodiment, the target data extraction module 400 specifically includes:

the first searching module is used for acquiring an extraction field and an extraction value path of a current page from the website crawling parameters;

and the first extraction module is used for extracting the target data of the current page from the response data of the current page according to the extraction field and the extraction value path of the current page.

In one embodiment, the first extraction module specifically includes:

the first extraction unit is used for extracting an extraction value corresponding to each extraction field from the response data of the current page according to the extraction field and the extraction value path of the current page if the data to be extracted corresponding to any extraction field of the current page contains a single piece of data, and taking all the extraction fields and the extracted extraction values as the target data of the current page.

In one embodiment, the first extracting module further includes:

the screening unit is used for taking the extraction field containing a plurality of pieces of data as a target field if the data to be extracted corresponding to at least one extraction field in the extraction field of the current page contains a plurality of pieces of data;

the traversal unit is used for acquiring a parent extraction path of the extraction value path of the target field through traversal;

and the second extraction unit is used for extracting an extraction value corresponding to each extraction field from the response data of the current page according to the extraction field of the current page, the extraction value paths of other extraction fields and the father extraction path, and taking all the extraction fields and the extracted extraction values as target data of the current page. In one embodiment, the first extracting module further includes:

and the judging unit is used for judging whether each extraction field of the current page contains a single piece of data or a plurality of pieces of data according to whether each extraction field of the current page in the website crawling parameters has a data extraction range.

In one embodiment, the second extraction unit specifically includes:

a first sub-extraction unit, configured to extract, according to other extraction fields except the target field of the current page and corresponding extraction value paths, an extraction value corresponding to each other extraction field from the response data of the current page,

a second sub-extraction unit, which is used for extracting all data in the value range of the target field as the extraction value of the target field according to the target field of the current page and the corresponding father extraction path,

and the integration unit is used for taking all the extracted fields and the extracted values of the current page as target data of the current page.

In one embodiment, the circulation module 600 specifically includes:

the second searching module is used for acquiring a page path of the next page from the website crawling parameters if the preset association does not exist between the next page and the current page;

and the second extraction module is used for extracting the uniform resource locator of the next page according to the page path of the next page and the response data of the current page.

In one embodiment, the circulation module 600 further includes:

the third searching module is used for taking the page path of the current page as the page path of the next page if the preset association exists between the next page and the current page;

and the third extraction module is used for extracting the uniform resource locator of the next page according to the page path of the next page and the response data of the current page.

Wherein the meaning of "first" and "second" in the above modules/units is only to distinguish different modules/units, and is not used to define which module/unit has higher priority or other defining meaning. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division and may be implemented in a practical application in a further manner.

The specific definition of the data crawling means can be referred to the definition of the data crawling method in the foregoing, and is not described in detail herein. The modules in the data crawling apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Fig. 4 is a block diagram of an internal structure of a computer device according to an embodiment of the present application. The computer device may specifically be the terminal 110 in fig. 1. As shown in fig. 4, the computer apparatus includes a processor, a memory, a network interface, an input device, a display screen, and a database connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory includes a storage medium and an internal memory. The storage medium may be a nonvolatile storage medium or a volatile storage medium. The storage medium stores an operating system and may also store computer readable instructions that, when executed by the processor, may cause the processor to implement a data crawling method. The internal memory provides an environment for the operating system and the execution of computer-readable instructions (computer programs) in the storage medium. The internal memory may also have stored therein computer-readable instructions that, when executed by the processor, cause the processor to perform a data crawling method. The network interface of the computer device is used for communicating with an external server through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of the data crawling method in the above embodiments are implemented, for example, the steps S100 to S700 shown in fig. 2 and other extensions of the method and related steps are extended. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the data crawling apparatus in the above embodiments, such as the functions of the modules 100 to 700 shown in fig. 3. To avoid repetition, further description is omitted here.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.

The memory may be integrated in the processor or may be provided separately from the processor.

Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the data crawling method in the above-described embodiments, such as the steps S100 to S700 shown in fig. 2 and extensions of other extensions and related steps of the method. Alternatively, the computer program is used to implement the functions of the modules/units of the data crawling apparatus in the above embodiments, such as the functions of the modules 100 to 700 shown in fig. 3, when being executed by the processor. To avoid repetition, further description is omitted here.

It will be understood by those of ordinary skill in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing associated hardware through computer readable instructions, and that the programs may be stored in a non-volatile computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present application may be substantially or partially embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims

1. A data crawling method, the method comprising:

acquiring a uniform resource locator of an initial target page of a target site from the website crawling parameters, and taking the initial target page as a current page;

acquiring target data crawled by a previous page of the current page, generating an access request carrying the target data of the previous page according to a uniform resource locator of the current page, and sending the access request to a server;

if the current page has a next page, acquiring a uniform resource locator of the next page, taking the next page as the current page, and circulating the step of acquiring target data crawled by the previous page of the current page to the step of judging whether the current page has the next page or not until the current page does not have the next page;

2. The method of claim 1, wherein extracting the target data of the current page from the response data comprises:

acquiring an extraction field and an extraction value path of the current page from the website crawling parameters;

3. The method according to claim 2, wherein the extracting the target data of the current page from the response data of the current page according to the extracted field and the extracted value path of the current page comprises:

if the data to be extracted corresponding to any extraction field of the current page comprises single data, extracting an extraction value corresponding to each extraction field from the response data of the current page according to the extraction field and the extraction value path of the current page, and taking all the extraction fields and the extracted extraction values as target data of the current page.

4. The method according to claim 3, wherein the extracting target data of the current page from the response data of the current page according to the extracted field and the extracted value path of the current page further comprises:

if the data to be extracted corresponding to at least one extraction field in the extraction fields of the current page comprises a plurality of pieces of data, taking the extraction fields comprising the plurality of pieces of data as target fields;

acquiring a parent extraction path of the extraction value path of the target field through traversal;

and extracting an extraction value corresponding to each extraction field from the response data of the current page according to the extraction field of the current page, the extraction value paths of other extraction fields and the father extraction path, and taking all the extracted extraction values and the corresponding extraction fields as target data of the current page.

5. The method according to claim 4, wherein the extracting target data of the current page from the response data of the current page according to the extracted field and the extracted value path of the current page further comprises:

judging whether each extraction field of the current page contains a single piece of data or a plurality of pieces of data according to whether each extraction field of the current page in the website crawling parameters has a data extraction range;

extracting an extraction value corresponding to each extraction field from response data of the current page according to the extraction field of the current page, the extraction value paths of other extraction fields and the parent extraction path, and taking all the extracted extraction values and the corresponding extraction fields as target data of the current page, including:

extracting the extraction value corresponding to each other extraction field from the response data of the current page according to the other extraction fields except the target field of the current page and the corresponding extraction value paths,

extracting all data in a value range corresponding to the target field as an extraction value of the target field according to the target field of the current page and the corresponding father extraction path,

6. The method according to any of claims 1-5, wherein said obtaining the uniform resource locator of the next page comprises:

7. The method of claim 6, wherein obtaining the uniform resource locator of the next page further comprises:

if the next page and the current page have preset association, taking the page path of the current page as the page path of the next page;

8. A data crawling apparatus, characterized in that the apparatus comprises:

the request module is used for acquiring target data crawled by a previous page of the current page, generating an access request carrying the target data of the previous page according to a uniform resource locator of the current page, and sending the access request to a server;

the target data extraction module is used for receiving response data returned after the server responds to the access request of the current page and extracting the target data of the current page from the response data, wherein the target data of the current page comprises the target data of the previous page;

the circulation module is used for acquiring the uniform resource locator of the next page if the current page has the next page, taking the next page as the current page, and circulating to the request module to the judgment module until the next page does not exist in the current page;

9. A computer device comprising a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, wherein the processor when executing the computer readable instructions performs the steps of the method of any one of claims 1-7.

10. A computer readable storage medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to perform the steps of the method of any one of claims 1-7.