WO2019019673A1

WO2019019673A1 - Website data crawling method and apparatus, computer device and readable storage medium

Info

Publication number: WO2019019673A1
Application number: PCT/CN2018/080126
Authority: WO
Inventors: 李江华; 李武奇
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2017-07-26
Filing date: 2018-03-23
Publication date: 2019-01-31
Also published as: CN107729344B; CN107729344A

Abstract

Disclosed is a website data crawling method, comprising: acquiring a data identifier and a generation date of website data to be crawled; acquiring a generation date of locally stored website data corresponding to the data identifier; when the generation date of the website data to be crawled is different from the generation date of the locally stored website data, crawling and outputting the website data to be crawled with a generation date earlier than the generation date of the locally stored website data; and when the format of the crawled website data to be crawled is identical to the format of the locally stored website data, outputting the locally stored website data with a generation date identical to the generation date of the website data to be crawled.

Description

Website data crawling method, device, computer device and readable storage medium

Cross-reference to related applications

This application claims to be filed on July 26, 2017, the Chinese Patent Office, application number: 201710620026X, the priority of the Chinese patent application entitled "Website data crawling method, device, computer equipment and readable storage medium", all of which The content is incorporated herein by reference.

Technical field

The application relates to a website data crawling method, device, computer device and readable storage medium.

Background technique

The crawling technology acquires and analyzes the webpage information through the URL link address, extracts all the URL link addresses, and then obtains the webpage information through the extracted URL link address, and executes the loop.

However, the inventor realized that the traditional crawling technique is to crawl all the data at once, and it needs to return the result immediately, and the amount of crawling data is large, and the crawling time is long, thereby causing the output display speed of the crawling data. Slower.

Summary of the invention

According to various embodiments disclosed herein, a website data crawling method, apparatus, computer device, and readable storage medium are provided.

A method for crawling website data, including:

Obtaining the data identification and date of generation of the website data to be crawled;

Obtaining a date of generating the locally stored website data corresponding to the data identifier;

When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, crawling the website data to be crawled before the date of generation of the website data stored locally;

Outputting the crawled website date of the website to be crawled before the date of generation of the locally stored website data;

Comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and

When the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of the website data to be crawled is output.

A website data crawling device comprising:

An obtaining module, configured to acquire a data identifier and a date of creation of the website data to be crawled; and obtain a date of generating the locally stored website data corresponding to the data identifier;

a crawling module, configured to: when the date of creation of the to-be-crawled website data is different from the date of generation of the locally stored website data, crawling the website data to be crawled before the date of generation of the website data stored locally ;

a first output module, configured to output the website data to be crawled before the date of generation of the locally stored website data by the generated date;

a comparison module for comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and

a second output module, configured to: when the format of the crawled website data to be crawled is the same as the format of the locally stored website data, output a website whose local storage date is the same as the date of the website data to be crawled data.

A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps:

One or more non-transitory computer readable instruction storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of:

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the present invention will be apparent from the description, drawings and claims.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings to be used in the embodiments will be briefly described below. Obviously, the drawings in the following description are only some embodiments of the present application, Those skilled in the art can also obtain other drawings based on these drawings without any creative work.

FIG. 1 is an application environment diagram of a website data crawling method according to one or more embodiments.

2 is a flow diagram of a method of crawling a website data in accordance with one or more embodiments.

3 is a timing diagram of a website data crawling method in accordance with one or more embodiments.

4 is a flow diagram of a segmentation crawling step in accordance with one or more embodiments.

FIG. 5 is a flow chart of step S210 in the embodiment shown in FIG. 2.

6 is a block diagram of a website data crawler in accordance with one or more embodiments.

FIG. 7 is a block diagram of a crawler terminal in accordance with one or more embodiments.

Detailed ways

In order to make the technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.

Referring to FIG. 1, FIG. 1 is an application environment diagram of a website data crawling method according to an embodiment, including a server of a target website and a crawler terminal in the Internet, and the crawler terminal may include a URL crawling end, an INFO crawling end, and a Format crawling. The client and the database can include application data and an index of the search engine (identity of the target website). In the first crawl, first, the operator will select the target website to be crawled, enter the target website into the station source list sitelist, and then the URL crawler will read the station source table sitelist and store it in the map (map). And formulate the regular parsing rules for the sites in the station source table. Second, according to the established regular parsing rules, the URL crawler crawls the corresponding URL list. Third, the INFO crawler reads the URL and its corresponding XPath rule from the database's URL list (XPath, which is the XMLPath Language (XMLPathLanguage), which is a language used to determine the location of a part of an XML document), and then Crawl each web page corresponding to the URL, extract the valuable resources according to the XPath rules, and store the extracted resources into the original data table originalresource. Finally, the Format crawler extracts data from the database raw data table originalresource, performs further regularization, aggregation, and finally stores it in the regular content table.

Referring to FIG. 2, in one embodiment, a website data crawling method is provided. The embodiment is applied to the crawler terminal in the application environment diagram of the website data crawling method in FIG. 1 to illustrate. . The crawler terminal runs a website data crawling readable instruction, and implements a website data crawling method by crawling the readable instructions of the website data. The method specifically includes the following steps:

S202: Obtain a data identifier and a date of generation of the website data to be crawled.

Specifically, the website data to be crawled is the data displayed in the webpage, which may be billing data, shopping record data, test data, etc., and is not limited herein.

The data identifier of the website data to be crawled refers to an identifier that can uniquely determine the data of the website to be crawled, and the data identifier may be determined by the website URL address, the user name, and the like to which the website data belongs. For example, when the website data to be crawled is billing data, the data identifier may be generated according to the website URL address, the user name, and the billing identifier. When the website data to be crawled is a shopping record, the data identifier may be based on the website URL address and the seller name. And buyer account generation.

The date when the data of the website to be crawled is the date involved in crawling the website data, which may be specific to a certain day, month or year, or a date range, for example, from June 1st. September 1st. For example, when the website data to be crawled is billing data, the date of generation of the website data to be crawled is the billing date. When the website data to be crawled is the shopping record data, the date when the date is placed is generated, for example, when multiple shopping records are involved, there may be multiple generation dates.

S204: Acquire a date of generating the locally stored website data corresponding to the data identifier.

Specifically, since the crawling terminal stores the crawled website data locally during the last crawling process, for example, the last time the billing data of July 1st to August 1st is crawled, the current crawling needs to be 6 From the billing data of the month 1st to the September 1st, since the billing data of July 1st to August 1st is stored locally, the crawler terminal does not need to crawl the billing data again.

S206: When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, the website data to be crawled before the date of generation of the website data stored locally is generated.

Specifically, the date when the data of the website to be crawled is different from the date of the website data stored locally means that the date ranges involved are different. For example, in the above example, the date of the website data to be crawled is June 1 Until September 1st, the locally stored website data is generated from July 1st to August 1st. Since the billing data from August 2nd to September 1st is not stored locally, you can climb August first. The billing data from the 2nd to the September 1st, that is, the website data to be crawled before the date of the generation of the website data stored locally.

S208: Output the crawled website data to be crawled before the date of generation of the locally stored website data.

Specifically, on the one hand, the crawler terminal can crawl the website data to be crawled before the date of generation of the website data stored locally by the first thread, and display the crawled data to the user in real time to ensure that the data is crawled to the user. Data shows speed and improves user experience. On the other hand, the crawler terminal can compare the format of the newly crawled website data to be crawled with the format of the locally stored website data through the second thread. For example, since the amount of data of the website to be crawled before the date of generation of the website data stored locally is generated, the crawler terminal can climb the website data in stages, for example, it can be crawled from August 25 to September 1 No. Crawling website data, when crawling to the website data to be crawled from August 25th to September 1st, trigger the second thread to compare the websites to be crawled from August 25th to September 1st. The data is the same as the format of the website data stored locally from July 1st to August 1st, and the first thread continues to crawl the data of the website to be crawled from August 2nd to August 25th.

S210: Compare the format of the crawled website data to be crawled with the format of the locally stored website data.

Specifically, the format of the website data to be crawled refers to a display format of the website data to be crawled, for example, it may be displayed through a table, and the form includes five fields, by comparing the format of the website data to be crawled and the local storage. The format of the website data to determine whether the locally stored website data is dirty data, that is, only the format of the website data to be crawled in the target website is consistent with the format of the locally stored website data, and the locally stored website data is determined to be valid data. , you can directly output the display for the user to view.

S212: When the format of the crawled website data that is crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of the website data to be crawled is generated.

Specifically, when the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website to which the website data to be crawled belongs is unchanged, and the data format thereof is unchanged, so that the local storage can be directly output. The website data reduces the amount of crawled data of the crawling terminal, thereby improving the output display speed of the crawled data.

The above-mentioned website data crawling method, device, computer device and readable storage medium first obtain the website data stored locally according to the data identifier before crawling the data to be crawled, when the locally stored website data and the website data to be crawled When the date of generation of the website data is different, the part of the data with the date before the date is first crawled and outputted, and when the format of the crawled data to be crawled is the same as the format of the locally stored website data, the data is no longer needed. Crawling the website data to be crawled in the same format as the locally stored website data, but directly outputting the locally stored website data, reducing the amount of data crawled, thereby improving the output display speed of the crawl data.

In one embodiment, the website data crawling method may further include: when the format of the crawled website data to be crawled is different from the format of the locally stored website data, continuing to crawl the generated date and the local storage. The website data is generated with the same date to be crawled; the output of the crawled website is the same as the date of the locally stored website data.

In this embodiment, the format of the website data to be crawled that has been crawled is first compared with the format of the website data stored locally, and when the formats of the two are different, the website that generates the date and the local storage is continuously crawled. The data generation date is the same as the website data to be crawled, so that the user can view the displayed website data to be crawled in real time, and can climb and climb according to the needs, thereby improving the efficiency of crawling.

In one embodiment, the website data crawling method may further include: when there is a website data to be crawled after the date of generation of the website data stored locally, the crawling of the website that generates the date locally is continued. The website data to be crawled after the date of the data generation; the date of the output of the crawl is to be crawled after the date of generation of the locally stored website data.

In this embodiment, when the website data to be crawled includes both the website data to be crawled after the date of generation of the website data stored locally, and the date to be crawled before the date of generation of the website data stored locally. Taking the website data, first crawling the website data to be crawled before the date of generation of the website data stored locally, and then crawling the website data to be crawled after the date of generation of the website data stored on the local date, The crawling of the website data is segmented and crawled, that is, the user can watch the displayed website data to be crawled in real time, and the crawling efficiency can be improved.

Referring to FIG. 3, FIG. 3 is a sequence diagram of a method for crawling a website data according to an embodiment, wherein the method for crawling the website data includes:

First, the user terminal sends a crawl request to the crawler terminal, for example, crawling the billing data from June 1st to September 1st, and the crawler terminal first queries the stored billing data in the local database, if the stored billing data in the local database is From July 1st to August 1st, the crawler terminal first crawls the billing data from August 2nd to September 1st from the billing page, and returns the billed data that is crawled to the user terminal through the first thread.

The crawler terminal then compares the format of the captured billing data with the format of the locally stored billing data by the second thread, and marks the local if the format of the locally stored billing data is different from the format of the billed data that is crawled. The billing data stored in the database is dirty data, and the billing data of July 1st to August 1st is continuously crawled, and the crawled billing data is sent to the user terminal. If the format of the locally stored billing data is the same as the format of the billed data that is crawled, the billing data stored in the local database is directly sent to the user terminal, that is, it is no longer necessary to crawl again from July 1 to August 1 Billing data.

Finally, the crawler terminal needs to determine whether the billing data to be crawled is crawled, that is, whether there is uncrawled billing data, such as billing data from June 1 to June 30 in this embodiment, and if so, continue Crawl the billing data from June 1st to June 30th and return the billed data to the user terminal.

In the above embodiment, the website data to be crawled is divided into the website data to be crawled before the date when the website data stored locally is generated, and the website to be crawled has the same date as the date of the locally stored website data. The data and the website data to be crawled after the date of generation of the website data stored locally, the crawler terminal first crawls the website data to be crawled before the date of generation of the website data stored locally, that is, August 2 The billing data until September 1st, and then by comparing whether the format of the crawled website data and the locally stored website data are changed to determine whether the website data stored in the venue can be directly used, that is, by comparing the to-be-crawled The format of the website data and the format of the locally stored website data to determine whether the locally stored website data is dirty data, that is, when the format of the website data to be crawled in the target website is changed, the website data stored locally and the website to be crawled are caused to be crawled. Take the format of the website data differently, and especially add a field to the website data to be crawled, etc. The website data that causes the local storage lacks certain information, so it is necessary to first determine the format of the locally stored website data before directly using the locally stored website data. When the format of the two is the same, the locally stored website data is directly sent to the user terminal for display, and when there is the website data to be crawled before the date of the generation of the website data stored locally, the crawling date is continued. The website data to be crawled before the date of the local stored website data is generated, and the crawled website data is sent to the user terminal, thereby reducing the amount of data crawled, thereby improving the output display speed of the crawl data.

In one embodiment, please refer to FIG. 4. FIG. 4 is a flowchart of a step-by-step crawling step in an embodiment. The network data crawling method further includes a segment crawling step, and the segment crawling step can be used. Crawling continues to crawl the website data to be crawled before the date of generation of the locally stored website data, and the date to be crawled is the same as the date of the locally stored website data, and the date of generation is locally stored. In the data to be crawled after the date of the generation of the website data, the embodiment is described by taking the data of the website to be crawled having the same date as the date of the website data stored locally as an example. The step of the step crawling may include :

S402: When the date of generation of the website data to be crawled with the same date of the website data stored locally is greater than the preset length, the website data of the website to be crawled having the same date as the website data stored locally is crawled in sequence. .

Specifically, the preset length refers to the length of the website data to be crawled, wherein one piece of data is one length, such as billing data, and 10 pieces of data are stored in the bill, and the data length is 10. The preset length is set according to the amount of data that the crawler terminal can read at one time or the amount of data that can be displayed by the web interface of the user terminal at one time. For example, the preset length can be set to 10, 15 or 12, etc. There are no restrictions here.

The example above is still described here. For example, the billing data with the same date of generation of the locally stored website data is from July 1 to August 1, in which 35 pieces of data are stored, and the crawler terminal is based on the date of generation. Before and after, first crawl the data with the date before the date, for example, first climb 10 bill data from July 25th to August 1st, and then climb 10 bill data from July 15th to July 24th. Then climb the 10 billing data from July 5th to July 14th, and finally climb the 5 billing data from July 1st to July 4th.

S404: The segmentation output crawls the website data to be crawled with the same date as the locally stored website data.

Specifically, when the crawler terminal climbs to the billing data, the billing data is output, for example, when the crawler terminal climbs 10 billing data from July 25 to August 1, then July 25 to 8 The 10 billing data of the month 1 is sent to the user terminal for display, and then 10 billing data from July 15th to July 24th is crawled, and then the crawled July 15th to July 24th The 10 billing data is sent to the user terminal for display, and so on, until the crawling is completed. In addition, the crawler terminal can also crawl 10 billing data from July 25th to August 1st through one thread, and send 10 billing data from July 25th to August 1st to another user through another thread. The terminal displays, and the original thread continues to crawl 10 billing data from July 15th to July 24th. When the original thread climbs to 10 billing data from July 15th to July 24th, The other thread sends the 10 billing data of the crawled July 15th to July 24th to the user terminal for display, and so on, until the crawling is completed.

In the above embodiment, in order to adopt the method of segmentation crawling, on the one hand, the network data to be crawled is crawled, and on the one hand, the crawled network data is sent to the user terminal for display, taking into account the user experience and the crawling efficiency.

In one embodiment, please refer to FIG. 5. FIG. 5 is a flowchart of step S210 in the embodiment shown in FIG. 2. The step S210 is a format of the crawled website data to be crawled and a locally stored website. The steps of comparing the format of the data may include:

S502: Match the field of the crawled website data to be crawled with the field of the locally stored website data.

Specifically, the field to be crawled of the website data is the content involved in crawling the website data, for example, a billing data may relate to a name, a payee, a payment time, a payment amount, and the like, and a field to be crawled on the website data and The fields of the locally stored website data are matched, for example, the fields of the website data to be crawled are the name, the payee, the payment time, the payment amount, and the reason, and the fields of the locally stored website data are the name, the payee, and the payment. Time, payment amount, it is considered that the field of the crawled website data that is crawled does not match the field of the locally stored website data, that is, unless the contents of the two fields are identical, the crawled website to be crawled is considered The fields of the data do not match the fields of the locally stored website data.

S504: When the field of the crawled website data that is crawled matches the field of the locally stored website data, the format of the crawled website data that is crawled is the same as the format of the locally stored website data.

S506: When the field of the crawled website data that is crawled does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.

Specifically, when the field of the crawled website data that is crawled matches the field of the locally stored website data, that is, the field of the crawled website data is completely the same as the field of the locally stored website data, the local storage is indicated. The website data is available data, so the locally stored website data is directly sent to the user terminal for display, and it is no longer necessary to crawl the website data again. When the field of the crawled website data that is crawled does not match the field of the locally stored website data, that is, the field of the crawled website data is not completely the same as the field of the locally stored website data, the local storage is The website data is dirty data, so the crawler terminal needs to crawl the data to be crawled and send the crawled network data to be crawled to the user terminal for display.

In the above embodiment, whether the format of the crawled website data to be crawled and the locally stored website data is determined by determining whether the field of the crawled website data to be crawled matches the field of the locally stored website data. The same, the judgment logic is simple.

It should be understood that although the various steps in the flowcharts of FIGS. 2-5 are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in Figures 2-5 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, these sub-steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with at least a portion of other steps or sub-steps or stages of other steps.

Referring to FIG. 6, FIG. 6 is a block diagram of a website data crawling device in an embodiment, where the website data crawling device includes:

The obtaining module 100 is configured to obtain a data identifier and a date of creation of the website data to be crawled, and obtain a date of generating the website data corresponding to the data identifier stored locally.

The crawling module 200 is configured to: when the date of creation of the website data to be crawled is different from the date of generation of the locally stored website data, crawl the website data to be crawled before the date of generation of the website data stored locally.

The first output module 300 is configured to output the to-be-crawled website data before the date when the crawled generated date is locally stored.

The comparison module 400 is configured to compare the format of the crawled website data to be crawled with the format of the locally stored website data. and

The second output module 500 is configured to: when the format of the crawled website data to be crawled is the same as the format of the locally stored website data, output the website data with the same generated date and the date of the website data to be crawled. .

In one embodiment, the crawling module 200 can also be configured to continue to crawl the generated date and the locally stored website data when the format of the crawled website data to be crawled is different from the format of the locally stored website data. The date of the website to be crawled with the same date. and

The first output module 300 is further configured to output the to-be-crawled website data whose generated date is the same as the date of generation of the locally stored website data.

In one embodiment, the crawl module 200 can also be configured to continue crawling the website data stored locally on the date when the website data to be crawled after the date of generation of the website data stored locally is generated. The website data to be crawled after the date is generated. and

The second output module 500 is further configured to output the crawled website date to be crawled after the date of generation of the locally stored website data.

In one embodiment, the crawl module 200 can also be configured to sequentially climb and store the data when the date of the website data to be crawled is the same as the date of the website data generated locally. The website data is generated with the same website data to be crawled; and the segmentation output crawls the website data to be crawled with the same date as the locally stored website data.

In one embodiment, the comparison module 400 is further configured to match the crawled field of the website data to be crawled with the field of the locally stored website data; when the field of the crawled website data is locally and locally When the fields of the stored website data match, the format of the crawled website data crawled is the same as the format of the locally stored website data; and the field of the crawled website data to be crawled and the locally stored website data When the fields do not match, the format of the crawled website data that is crawled is different from the format of the locally stored website data.

For the specific definition of the website data crawling device, refer to the above definition of the website data crawling method, and details are not described herein again. The various modules in the above website data crawling device may be implemented in whole or in part by software, hardware, and combinations thereof. Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a crawler terminal, and its internal structure diagram may be as shown in FIG. The computer device includes a processor, memory, network interface, display screen, and input device connected by a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for operation of an operating system and computer programs in a non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to implement a website data crawling method. The display screen of the computer device may be a liquid crystal display or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or may be a button, a trackball or a touchpad provided on the computer device casing. It can also be an external keyboard, trackpad or mouse.

It will be understood by those skilled in the art that the structure shown in FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied. The specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.

A computer device comprising a memory and one or more processors, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the one or more processors to perform the following steps: obtaining a to-be-crawled The data identification and generation date of the website data; obtaining the date of generation of the website data corresponding to the data identifier stored locally; when the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, the date of the crawl generation The website data to be crawled before the date of generation of the locally stored website data; the data of the website to be crawled before the date of generation of the locally stored website data is output; the crawled website to be crawled The format of the data is compared with the format of the locally stored website data; and when the format of the crawled website data is the same as the format of the locally stored website data, the local storage generation date and the website data to be crawled are output. Generate website data with the same date.

In one embodiment, when the processor executes the computer readable instructions, the following steps may be further implemented: when the format of the crawled website data to be crawled is different from the format of the locally stored website data, the crawling date is continued. The website data to be crawled having the same date as the date of the locally stored website data; and the data of the website to be crawled whose output date is the same as the date of generation of the locally stored website data.

In one embodiment, when the processor executes the computer readable instructions, the following steps may be further implemented: when there is a website data to be crawled after the date of generation of the website data stored locally, the crawling date is continued. The website data to be crawled after the date of generation of the locally stored website data; and the output date of the crawled website is to be crawled after the date of generation of the locally stored website data.

In one embodiment, when the processor executes the computer readable instructions, the following steps may be further implemented: when the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, The segment crawls the website data to be crawled with the same date as the locally stored website data; and segments the output of the website data to be crawled which is the same as the date of the locally stored website data.

In one embodiment, the processor may further implement the following steps: the field of the crawled website data to be crawled is matched with the field of the locally stored website data; when the crawled crawl is to be crawled When the field of the website data matches the field of the locally stored website data, the format of the crawled website data is the same as the format of the locally stored website data; and when the crawled website data is crawled When the field does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.

One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of: obtaining website data to be crawled Data identification and date of creation; obtaining the date of generation of the website data corresponding to the data identifier stored locally; when the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, the crawl generation date is locally The website data to be crawled before the date of generation of the stored website data; the data of the website to be crawled before the date of generation of the website data stored locally is output; the data of the website to be crawled is to be crawled The format is compared with the format of the locally stored website data; and when the format of the crawled website data is the same as the format of the locally stored website data, the generation date of the local storage and the generation of the website data to be crawled are output. Site data with the same date.

In one embodiment, when the computer readable instructions are executed by the processor, the following steps may be further implemented: when the format of the crawled website data to be crawled is different from the format of the locally stored website data, then the crawling continues The website data to be crawled having the same date as the date of the locally stored website data is generated; and the website data to be crawled whose output date is the same as the date of the locally stored website data is output.

In one embodiment, when the computer readable instructions are executed by the processor, the following steps may be further implemented: when there is a website data to be crawled after the date of generation of the website data stored locally, the crawling is continued. The website data to be crawled after the date of generation of the website data stored locally; and the date of generation of the crawled website is to be crawled after the date of generation of the locally stored website data.

In one embodiment, when the computer readable instructions are executed by the processor, the following steps may be further implemented: when the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, then And sequentially crawling the website data to be crawled with the same date as the locally stored website data; and segmentally outputting the website data to be crawled that is the same as the date of the locally stored website data.

In one embodiment, the computer readable instructions when executed by the processor may further implement the steps of: matching the fields of the crawled website data to be crawled with the fields of the locally stored website data; When the field of the website data to be crawled matches the field of the locally stored website data, the format of the crawled website data is the same as the format of the locally stored website data; and when the crawled website is crawled When the field of the data does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.

One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by computer readable instructions, which can be stored in a non-volatile computer. The readable storage medium, which when executed, may include the flow of an embodiment of the methods as described above. Any reference to a memory, storage, database or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain. Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).

The technical features of the above-described embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, All should be considered as the scope of this manual.

The above-mentioned embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims

A method for crawling website data, including:

Obtaining the data identification and date of generation of the website data to be crawled;

Obtaining a date of generating the locally stored website data corresponding to the data identifier;

When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, crawling the website data to be crawled before the date of generation of the website data stored locally;

Outputting the crawled website date of the website to be crawled before the date of generation of the locally stored website data;

Comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and

When the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of the website data to be crawled is output.
The method of claim 1 further comprising:

When the format of the crawled website data to be crawled is different from the format of the locally stored website data, the crawling of the website data to be crawled with the same date as the locally stored website data is continuously crawled; and

The crawled website data whose output date is the same as the date of the locally stored website data is output.
The method of claim 2, wherein the method further comprises:

When there is a website data to be crawled after the date of generation of the website data stored locally, the crawling website data of the date of the website data stored locally is generated; and

The output date of the output crawl is to be crawled after the date of generation of the locally stored website data.
The method according to claim 2, wherein the crawling to obtain the website data to be crawled having the same date as the date of the locally stored website data includes:

When the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, the website data to be crawled that is the same as the date of generation of the locally stored website data is sequentially segmented. ;and

The segmentation output crawls the website data to be crawled with the same date as the locally stored website data.
The method according to claim 1, wherein the comparing the format of the crawled website data to be crawled with the format of the locally stored website data comprises:

Matching the fields of the crawled website data that are crawled with the fields of the locally stored website data;

When the field of the crawled website data that is crawled matches the field of the locally stored website data, the format of the crawled website data that is crawled is the same as the format of the locally stored website data;

When the field of the crawled website data that is crawled does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
A website data crawling device comprising:

An obtaining module, configured to acquire a data identifier and a date of creation of the website data to be crawled; and obtain a date of generating the locally stored website data corresponding to the data identifier;

a crawling module, configured to: when the date of creation of the to-be-crawled website data is different from the date of generation of the locally stored website data, crawling the website data to be crawled before the date of generation of the website data stored locally ;

a first output module, configured to output the website data to be crawled before the date of generation of the locally stored website data by the generated date;

a comparison module, configured to compare the format of the crawled website data to be crawled with the format of the locally stored website data;

a second output module, configured to: when the format of the crawled website data to be crawled is the same as the format of the locally stored website data, output a website whose local storage date is the same as the date of the website data to be crawled data.
The device according to claim 6, wherein the crawling module is further configured to continue to crawl the date of creation when the format of the crawled website data to be crawled is different from the format of the locally stored website data. The website data to be crawled with the same date as the locally stored website data; and

The first output module is further configured to output the to-be-crawled website data whose generated date is the same as the date of generation of the locally stored website data.
The device according to claim 7, wherein the crawling module is further configured to: when there is a website data to be crawled after a date of generation of the website data stored locally, the crawling date is continued The website data to be crawled after the date of generation of the locally stored website data; and

The second output module is further configured to output the crawled website date to be crawled after the date of generation of the locally stored website data.
The device according to claim 7, wherein the crawling module is further configured to: when the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, The segment crawling the to-be-crawled website data having the same date as the locally stored website data; and segmentally outputting the crawled website data that is the same as the locally generated website data.
The device according to claim 6, wherein the comparing module is further configured to match the crawled field of the website data to be crawled with the field of the locally stored website data; when the crawled crawl is to be crawled When the field of the website data matches the field of the locally stored website data, the format of the crawled website data is the same as the format of the locally stored website data; and the field of the crawled website data to be crawled When the field of the locally stored website data does not match, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executed by the one or more processors to cause the one or more The processors perform the following steps:

Obtaining the data identification and date of generation of the website data to be crawled;

Obtaining a date of generating the locally stored website data corresponding to the data identifier;

When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, crawling the website data to be crawled before the date of generation of the website data stored locally;

Outputting the crawled website date of the website to be crawled before the date of generation of the locally stored website data;

Comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and

When the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of the website data to be crawled is output.
The computer apparatus according to claim 11, wherein said processor further performs the following steps when said computer readable instructions are executed:

When the format of the crawled website data to be crawled is different from the format of the locally stored website data, the crawling of the website data to be crawled with the same date as the locally stored website data is continuously crawled; and

The crawled website data whose output date is the same as the date of the locally stored website data is output.
The computer apparatus according to claim 12, wherein said processor further performs the following steps when said computer readable instructions are executed:

When there is a website data to be crawled after the date of generation of the website data stored locally, the crawling website data of the date of the website data stored locally is generated; and

The output date of the output crawl is to be crawled after the date of generation of the locally stored website data.
The computer device according to claim 12, wherein the processor performs the computer readable instruction to execute the crawling website data of the same crawling date as the date of the locally stored website data, include:

When the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, the website data to be crawled that is the same as the date of generation of the locally stored website data is sequentially segmented. ;and

The segmentation output crawls the website data to be crawled with the same date as the locally stored website data.
The computer device according to claim 11, wherein the processor executes the computer readable instructions to perform the format of the crawled website data to be crawled and the format of the locally stored website data. Comparison, including:

Matching the fields of the crawled website data that are crawled with the fields of the locally stored website data;

When the field of the crawled website data that is crawled matches the field of the locally stored website data, the format of the crawled website data that is crawled is the same as the format of the locally stored website data;

When the field of the crawled website data that is crawled does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:

Obtaining the data identification and date of generation of the website data to be crawled;

Obtaining a date of generating the locally stored website data corresponding to the data identifier;

When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, crawling the website data to be crawled before the date of generation of the website data stored locally;

Outputting the crawled website date of the website to be crawled before the date of generation of the locally stored website data;

Comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and

When the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of generation of the website data to be crawled is output.
The storage medium of claim 16 wherein said computer readable instructions are further executed by said processor to perform the following steps:

When the format of the crawled website data to be crawled is different from the format of the locally stored website data, the crawling of the website data to be crawled with the same date as the locally stored website data is continuously crawled; and

The crawled website data whose output date is the same as the date of the locally stored website data is output.
A storage medium according to claim 17, wherein said computer readable instructions are further executed by said processor to perform the following steps:

When there is a website data to be crawled after the date of generation of the website data stored locally, the crawling website data of the date of the website data stored locally is generated; and

The output date of the output crawl is to be crawled after the date of generation of the locally stored website data.
The storage medium according to claim 17, wherein said computer readable instructions are executed by said processor to continue crawling to generate a website to be crawled with a date of generation of locally generated website data Data, including:

When the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, the website data to be crawled that is the same as the date of generation of the locally stored website data is sequentially segmented. ;and

The segmentation output crawls the website data to be crawled with the same date as the locally stored website data.
The storage medium according to claim 16, wherein said computer readable instructions are executed by said processor to perform a format of the crawled website data to be crawled and a format of locally stored website data. Comparison, including:

Matching the fields of the crawled website data that are crawled with the fields of the locally stored website data;

When the field of the crawled website data that is crawled matches the field of the locally stored website data, the format of the crawled website data that is crawled is the same as the format of the locally stored website data;

When the field of the crawled website data that is crawled does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.