WO2019019673A1 - Website data crawling method and apparatus, computer device and readable storage medium - Google Patents

Website data crawling method and apparatus, computer device and readable storage medium Download PDF

Info

Publication number
WO2019019673A1
WO2019019673A1 PCT/CN2018/080126 CN2018080126W WO2019019673A1 WO 2019019673 A1 WO2019019673 A1 WO 2019019673A1 CN 2018080126 W CN2018080126 W CN 2018080126W WO 2019019673 A1 WO2019019673 A1 WO 2019019673A1
Authority
WO
WIPO (PCT)
Prior art keywords
crawled
website data
date
data
website
Prior art date
Application number
PCT/CN2018/080126
Other languages
French (fr)
Chinese (zh)
Inventor
李江华
李武奇
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2019019673A1 publication Critical patent/WO2019019673A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the application relates to a website data crawling method, device, computer device and readable storage medium.
  • the crawling technology acquires and analyzes the webpage information through the URL link address, extracts all the URL link addresses, and then obtains the webpage information through the extracted URL link address, and executes the loop.
  • a website data crawling method, apparatus, computer device, and readable storage medium are provided.
  • a method for crawling website data including:
  • the website data whose generated date is the same as the date of the website data to be crawled is output.
  • a website data crawling device comprising:
  • An obtaining module configured to acquire a data identifier and a date of creation of the website data to be crawled; and obtain a date of generating the locally stored website data corresponding to the data identifier;
  • a crawling module configured to: when the date of creation of the to-be-crawled website data is different from the date of generation of the locally stored website data, crawling the website data to be crawled before the date of generation of the website data stored locally ;
  • a first output module configured to output the website data to be crawled before the date of generation of the locally stored website data by the generated date
  • a comparison module for comparing the format of the crawled website data to be crawled with the format of the locally stored website data
  • a second output module configured to: when the format of the crawled website data to be crawled is the same as the format of the locally stored website data, output a website whose local storage date is the same as the date of the website data to be crawled data.
  • a computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps:
  • the website data whose generated date is the same as the date of the website data to be crawled is output.
  • One or more non-transitory computer readable instruction storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of:
  • the website data whose generated date is the same as the date of the website data to be crawled is output.
  • FIG. 1 is an application environment diagram of a website data crawling method according to one or more embodiments.
  • FIG. 2 is a flow diagram of a method of crawling a website data in accordance with one or more embodiments.
  • FIG. 3 is a timing diagram of a website data crawling method in accordance with one or more embodiments.
  • FIG. 4 is a flow diagram of a segmentation crawling step in accordance with one or more embodiments.
  • FIG. 5 is a flow chart of step S210 in the embodiment shown in FIG. 2.
  • FIG. 6 is a block diagram of a website data crawler in accordance with one or more embodiments.
  • FIG. 7 is a block diagram of a crawler terminal in accordance with one or more embodiments.
  • FIG. 1 is an application environment diagram of a website data crawling method according to an embodiment, including a server of a target website and a crawler terminal in the Internet, and the crawler terminal may include a URL crawling end, an INFO crawling end, and a Format crawling.
  • the client and the database can include application data and an index of the search engine (identity of the target website).
  • the operator will select the target website to be crawled, enter the target website into the station source list sitelist, and then the URL crawler will read the station source table sitelist and store it in the map (map). And formulate the regular parsing rules for the sites in the station source table.
  • the URL crawler crawls the corresponding URL list.
  • the INFO crawler reads the URL and its corresponding XPath rule from the database's URL list (XPath, which is the XMLPath Language (XMLPathLanguage), which is a language used to determine the location of a part of an XML document), and then Crawl each web page corresponding to the URL, extract the valuable resources according to the XPath rules, and store the extracted resources into the original data table originalresource.
  • the Format crawler extracts data from the database raw data table originalresource, performs further regularization, aggregation, and finally stores it in the regular content table.
  • a website data crawling method is provided.
  • the embodiment is applied to the crawler terminal in the application environment diagram of the website data crawling method in FIG. 1 to illustrate.
  • the crawler terminal runs a website data crawling readable instruction, and implements a website data crawling method by crawling the readable instructions of the website data.
  • the method specifically includes the following steps:
  • S202 Obtain a data identifier and a date of generation of the website data to be crawled.
  • the website data to be crawled is the data displayed in the webpage, which may be billing data, shopping record data, test data, etc., and is not limited herein.
  • the data identifier of the website data to be crawled refers to an identifier that can uniquely determine the data of the website to be crawled, and the data identifier may be determined by the website URL address, the user name, and the like to which the website data belongs. For example, when the website data to be crawled is billing data, the data identifier may be generated according to the website URL address, the user name, and the billing identifier. When the website data to be crawled is a shopping record, the data identifier may be based on the website URL address and the seller name. And buyer account generation.
  • the date when the data of the website to be crawled is the date involved in crawling the website data, which may be specific to a certain day, month or year, or a date range, for example, from June 1st. September 1st.
  • the date of generation of the website data to be crawled is the billing date.
  • the date when the date is placed is generated, for example, when multiple shopping records are involved, there may be multiple generation dates.
  • S204 Acquire a date of generating the locally stored website data corresponding to the data identifier.
  • the crawling terminal since the crawling terminal stores the crawled website data locally during the last crawling process, for example, the last time the billing data of July 1st to August 1st is crawled, the current crawling needs to be 6 From the billing data of the month 1st to the September 1st, since the billing data of July 1st to August 1st is stored locally, the crawler terminal does not need to crawl the billing data again.
  • the date when the data of the website to be crawled is different from the date of the website data stored locally means that the date ranges involved are different.
  • the date of the website data to be crawled is June 1 Until September 1st
  • the locally stored website data is generated from July 1st to August 1st. Since the billing data from August 2nd to September 1st is not stored locally, you can climb August first.
  • the billing data from the 2nd to the September 1st that is, the website data to be crawled before the date of the generation of the website data stored locally.
  • S208 Output the crawled website data to be crawled before the date of generation of the locally stored website data.
  • the crawler terminal can crawl the website data to be crawled before the date of generation of the website data stored locally by the first thread, and display the crawled data to the user in real time to ensure that the data is crawled to the user. Data shows speed and improves user experience.
  • the crawler terminal can compare the format of the newly crawled website data to be crawled with the format of the locally stored website data through the second thread. For example, since the amount of data of the website to be crawled before the date of generation of the website data stored locally is generated, the crawler terminal can climb the website data in stages, for example, it can be crawled from August 25 to September 1 No.
  • Crawling website data when crawling to the website data to be crawled from August 25th to September 1st, trigger the second thread to compare the websites to be crawled from August 25th to September 1st.
  • the data is the same as the format of the website data stored locally from July 1st to August 1st, and the first thread continues to crawl the data of the website to be crawled from August 2nd to August 25th.
  • S210 Compare the format of the crawled website data to be crawled with the format of the locally stored website data.
  • the format of the website data to be crawled refers to a display format of the website data to be crawled, for example, it may be displayed through a table, and the form includes five fields, by comparing the format of the website data to be crawled and the local storage.
  • the format of the crawled website data to be crawled is the same as the format of the locally stored website data
  • the website to which the website data to be crawled belongs is unchanged, and the data format thereof is unchanged, so that the local storage can be directly output.
  • the website data reduces the amount of crawled data of the crawling terminal, thereby improving the output display speed of the crawled data.
  • the above-mentioned website data crawling method, device, computer device and readable storage medium first obtain the website data stored locally according to the data identifier before crawling the data to be crawled, when the locally stored website data and the website data to be crawled
  • the date of generation of the website data is different, the part of the data with the date before the date is first crawled and outputted, and when the format of the crawled data to be crawled is the same as the format of the locally stored website data, the data is no longer needed. Crawling the website data to be crawled in the same format as the locally stored website data, but directly outputting the locally stored website data, reducing the amount of data crawled, thereby improving the output display speed of the crawl data.
  • the website data crawling method may further include: when the format of the crawled website data to be crawled is different from the format of the locally stored website data, continuing to crawl the generated date and the local storage.
  • the website data is generated with the same date to be crawled; the output of the crawled website is the same as the date of the locally stored website data.
  • the format of the website data to be crawled that has been crawled is first compared with the format of the website data stored locally, and when the formats of the two are different, the website that generates the date and the local storage is continuously crawled.
  • the data generation date is the same as the website data to be crawled, so that the user can view the displayed website data to be crawled in real time, and can climb and climb according to the needs, thereby improving the efficiency of crawling.
  • the website data crawling method may further include: when there is a website data to be crawled after the date of generation of the website data stored locally, the crawling of the website that generates the date locally is continued.
  • the website data to be crawled includes both the website data to be crawled after the date of generation of the website data stored locally, and the date to be crawled before the date of generation of the website data stored locally.
  • the website data first crawling the website data to be crawled before the date of generation of the website data stored locally, and then crawling the website data to be crawled after the date of generation of the website data stored on the local date,
  • the crawling of the website data is segmented and crawled, that is, the user can watch the displayed website data to be crawled in real time, and the crawling efficiency can be improved.
  • FIG. 3 is a sequence diagram of a method for crawling a website data according to an embodiment, wherein the method for crawling the website data includes:
  • the user terminal sends a crawl request to the crawler terminal, for example, crawling the billing data from June 1st to September 1st, and the crawler terminal first queries the stored billing data in the local database, if the stored billing data in the local database is From July 1st to August 1st, the crawler terminal first crawls the billing data from August 2nd to September 1st from the billing page, and returns the billed data that is crawled to the user terminal through the first thread.
  • a crawl request for example, crawling the billing data from June 1st to September 1st
  • the crawler terminal first queries the stored billing data in the local database, if the stored billing data in the local database is From July 1st to August 1st, the crawler terminal first crawls the billing data from August 2nd to September 1st from the billing page, and returns the billed data that is crawled to the user terminal through the first thread.
  • the crawler terminal compares the format of the captured billing data with the format of the locally stored billing data by the second thread, and marks the local if the format of the locally stored billing data is different from the format of the billed data that is crawled.
  • the billing data stored in the database is dirty data, and the billing data of July 1st to August 1st is continuously crawled, and the crawled billing data is sent to the user terminal. If the format of the locally stored billing data is the same as the format of the billed data that is crawled, the billing data stored in the local database is directly sent to the user terminal, that is, it is no longer necessary to crawl again from July 1 to August 1 Billing data.
  • the crawler terminal needs to determine whether the billing data to be crawled is crawled, that is, whether there is uncrawled billing data, such as billing data from June 1 to June 30 in this embodiment, and if so, continue Crawl the billing data from June 1st to June 30th and return the billed data to the user terminal.
  • the website data to be crawled is divided into the website data to be crawled before the date when the website data stored locally is generated, and the website to be crawled has the same date as the date of the locally stored website data.
  • the data and the website data to be crawled after the date of generation of the website data stored locally the crawler terminal first crawls the website data to be crawled before the date of generation of the website data stored locally, that is, August 2 The billing data until September 1st, and then by comparing whether the format of the crawled website data and the locally stored website data are changed to determine whether the website data stored in the venue can be directly used, that is, by comparing the to-be-crawled
  • the website data that causes the local storage lacks certain information, so it is necessary to first determine the format of the locally stored website data before directly using the locally stored website data.
  • the locally stored website data is directly sent to the user terminal for display, and when there is the website data to be crawled before the date of the generation of the website data stored locally, the crawling date is continued.
  • the website data to be crawled before the date of the local stored website data is generated, and the crawled website data is sent to the user terminal, thereby reducing the amount of data crawled, thereby improving the output display speed of the crawl data.
  • FIG. 4 is a flowchart of a step-by-step crawling step in an embodiment.
  • the network data crawling method further includes a segment crawling step, and the segment crawling step can be used. Crawling continues to crawl the website data to be crawled before the date of generation of the locally stored website data, and the date to be crawled is the same as the date of the locally stored website data, and the date of generation is locally stored.
  • the embodiment is described by taking the data of the website to be crawled having the same date as the date of the website data stored locally as an example.
  • the step of the step crawling may include :
  • the preset length refers to the length of the website data to be crawled, wherein one piece of data is one length, such as billing data, and 10 pieces of data are stored in the bill, and the data length is 10.
  • the preset length is set according to the amount of data that the crawler terminal can read at one time or the amount of data that can be displayed by the web interface of the user terminal at one time.
  • the preset length can be set to 10, 15 or 12, etc. There are no restrictions here.
  • the billing data with the same date of generation of the locally stored website data is from July 1 to August 1, in which 35 pieces of data are stored, and the crawler terminal is based on the date of generation.
  • first crawl the data with the date before the date for example, first climb 10 bill data from July 25th to August 1st, and then climb 10 bill data from July 15th to July 24th. Then climb the 10 billing data from July 5th to July 14th, and finally climb the 5 billing data from July 1st to July 4th.
  • S404 The segmentation output crawls the website data to be crawled with the same date as the locally stored website data.
  • the billing data is output, for example, when the crawler terminal climbs 10 billing data from July 25 to August 1, then July 25 to 8
  • the 10 billing data of the month 1 is sent to the user terminal for display, and then 10 billing data from July 15th to July 24th is crawled, and then the crawled July 15th to July 24th
  • the 10 billing data is sent to the user terminal for display, and so on, until the crawling is completed.
  • the crawler terminal can also crawl 10 billing data from July 25th to August 1st through one thread, and send 10 billing data from July 25th to August 1st to another user through another thread.
  • the terminal displays, and the original thread continues to crawl 10 billing data from July 15th to July 24th.
  • the other thread sends the 10 billing data of the crawled July 15th to July 24th to the user terminal for display, and so on, until the crawling is completed.
  • the network data to be crawled is crawled, and on the one hand, the crawled network data is sent to the user terminal for display, taking into account the user experience and the crawling efficiency.
  • FIG. 5 is a flowchart of step S210 in the embodiment shown in FIG. 2.
  • the step S210 is a format of the crawled website data to be crawled and a locally stored website.
  • the steps of comparing the format of the data may include:
  • S502 Match the field of the crawled website data to be crawled with the field of the locally stored website data.
  • the field to be crawled of the website data is the content involved in crawling the website data
  • a billing data may relate to a name, a payee, a payment time, a payment amount, and the like, and a field to be crawled on the website data and
  • the fields of the locally stored website data are matched, for example, the fields of the website data to be crawled are the name, the payee, the payment time, the payment amount, and the reason, and the fields of the locally stored website data are the name, the payee, and the payment.
  • the local storage is indicated.
  • the website data is available data, so the locally stored website data is directly sent to the user terminal for display, and it is no longer necessary to crawl the website data again.
  • the local storage is The website data is dirty data, so the crawler terminal needs to crawl the data to be crawled and send the crawled network data to be crawled to the user terminal for display.
  • whether the format of the crawled website data to be crawled and the locally stored website data is determined by determining whether the field of the crawled website data to be crawled matches the field of the locally stored website data.
  • the judgment logic is simple.
  • FIG. 6 is a block diagram of a website data crawling device in an embodiment, where the website data crawling device includes:
  • the obtaining module 100 is configured to obtain a data identifier and a date of creation of the website data to be crawled, and obtain a date of generating the website data corresponding to the data identifier stored locally.
  • the crawling module 200 is configured to: when the date of creation of the website data to be crawled is different from the date of generation of the locally stored website data, crawl the website data to be crawled before the date of generation of the website data stored locally.
  • the first output module 300 is configured to output the to-be-crawled website data before the date when the crawled generated date is locally stored.
  • the comparison module 400 is configured to compare the format of the crawled website data to be crawled with the format of the locally stored website data.
  • the second output module 500 is configured to: when the format of the crawled website data to be crawled is the same as the format of the locally stored website data, output the website data with the same generated date and the date of the website data to be crawled. .
  • the crawling module 200 can also be configured to continue to crawl the generated date and the locally stored website data when the format of the crawled website data to be crawled is different from the format of the locally stored website data.
  • the first output module 300 is further configured to output the to-be-crawled website data whose generated date is the same as the date of generation of the locally stored website data.
  • the crawl module 200 can also be configured to continue crawling the website data stored locally on the date when the website data to be crawled after the date of generation of the website data stored locally is generated. The website data to be crawled after the date is generated.
  • the second output module 500 is further configured to output the crawled website date to be crawled after the date of generation of the locally stored website data.
  • the crawl module 200 can also be configured to sequentially climb and store the data when the date of the website data to be crawled is the same as the date of the website data generated locally.
  • the website data is generated with the same website data to be crawled; and the segmentation output crawls the website data to be crawled with the same date as the locally stored website data.
  • the comparison module 400 is further configured to match the crawled field of the website data to be crawled with the field of the locally stored website data; when the field of the crawled website data is locally and locally When the fields of the stored website data match, the format of the crawled website data crawled is the same as the format of the locally stored website data; and the field of the crawled website data to be crawled and the locally stored website data When the fields do not match, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
  • the various modules in the above website data crawling device may be implemented in whole or in part by software, hardware, and combinations thereof.
  • Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.
  • a computer device which may be a crawler terminal, and its internal structure diagram may be as shown in FIG.
  • the computer device includes a processor, memory, network interface, display screen, and input device connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for operation of an operating system and computer programs in a non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to implement a website data crawling method.
  • the display screen of the computer device may be a liquid crystal display or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or may be a button, a trackball or a touchpad provided on the computer device casing. It can also be an external keyboard, trackpad or mouse.
  • FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
  • a computer device comprising a memory and one or more processors, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the one or more processors to perform the following steps: obtaining a to-be-crawled The data identification and generation date of the website data; obtaining the date of generation of the website data corresponding to the data identifier stored locally; when the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, the date of the crawl generation The website data to be crawled before the date of generation of the locally stored website data; the data of the website to be crawled before the date of generation of the locally stored website data is output; the crawled website to be crawled The format of the data is compared with the format of the locally stored website data; and when the format of the crawled website data is the same as the format of the locally stored website data, the local storage generation date and the website data to be crawled are output. Generate website data with the same date.
  • the processor executes the computer readable instructions, the following steps may be further implemented: when the format of the crawled website data to be crawled is different from the format of the locally stored website data, the crawling date is continued.
  • the website data to be crawled having the same date as the date of the locally stored website data; and the data of the website to be crawled whose output date is the same as the date of generation of the locally stored website data.
  • the processor executes the computer readable instructions, the following steps may be further implemented: when there is a website data to be crawled after the date of generation of the website data stored locally, the crawling date is continued.
  • the processor executes the computer readable instructions, the following steps may be further implemented: when the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, The segment crawls the website data to be crawled with the same date as the locally stored website data; and segments the output of the website data to be crawled which is the same as the date of the locally stored website data.
  • the processor may further implement the following steps: the field of the crawled website data to be crawled is matched with the field of the locally stored website data; when the crawled crawl is to be crawled When the field of the website data matches the field of the locally stored website data, the format of the crawled website data is the same as the format of the locally stored website data; and when the crawled website data is crawled When the field does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
  • One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of: obtaining website data to be crawled Data identification and date of creation; obtaining the date of generation of the website data corresponding to the data identifier stored locally; when the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, the crawl generation date is locally
  • the format is compared with the format of the locally stored website data; and when the format of the crawled website data is the same as the format of the locally stored website data, the generation date of the local storage and the generation of the website data to be crawled are output.
  • the following steps may be further implemented: when the format of the crawled website data to be crawled is different from the format of the locally stored website data, then the crawling continues The website data to be crawled having the same date as the date of the locally stored website data is generated; and the website data to be crawled whose output date is the same as the date of the locally stored website data is output.
  • the following steps may be further implemented: when there is a website data to be crawled after the date of generation of the website data stored locally, the crawling is continued.
  • the following steps may be further implemented: when the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, then And sequentially crawling the website data to be crawled with the same date as the locally stored website data; and segmentally outputting the website data to be crawled that is the same as the date of the locally stored website data.
  • the computer readable instructions when executed by the processor may further implement the steps of: matching the fields of the crawled website data to be crawled with the fields of the locally stored website data; When the field of the website data to be crawled matches the field of the locally stored website data, the format of the crawled website data is the same as the format of the locally stored website data; and when the crawled website is crawled When the field of the data does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed is a website data crawling method, comprising: acquiring a data identifier and a generation date of website data to be crawled; acquiring a generation date of locally stored website data corresponding to the data identifier; when the generation date of the website data to be crawled is different from the generation date of the locally stored website data, crawling and outputting the website data to be crawled with a generation date earlier than the generation date of the locally stored website data; and when the format of the crawled website data to be crawled is identical to the format of the locally stored website data, outputting the locally stored website data with a generation date identical to the generation date of the website data to be crawled.

Description

网站数据爬取方法、装置、计算机设备及可读存储介质Website data crawling method, device, computer device and readable storage medium
相关申请的交叉引用Cross-reference to related applications
本申请要求于2017年7月26日提交中国专利局,申请号为201710620026X,申请名称为“网站数据爬取方法、装置、计算机设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims to be filed on July 26, 2017, the Chinese Patent Office, application number: 201710620026X, the priority of the Chinese patent application entitled "Website data crawling method, device, computer equipment and readable storage medium", all of which The content is incorporated herein by reference.
技术领域Technical field
本申请涉及一种网站数据爬取方法、装置、计算机设备及可读存储介质。The application relates to a website data crawling method, device, computer device and readable storage medium.
背景技术Background technique
爬取技术是通过URL链接地址获取并分析网页信息,按照提取所有的URL链接地址,然后再通过提取的URL链接地址获取网页信息,循环执行。The crawling technology acquires and analyzes the webpage information through the URL link address, extracts all the URL link addresses, and then obtains the webpage information through the extracted URL link address, and executes the loop.
然而,发明人意识到,传统的爬取技术是一次性将所有的数据全部爬取,且需要即时返回结果,爬取数据量大,爬取时间较长,从而导致爬取数据的输出显示速度较慢。However, the inventor realized that the traditional crawling technique is to crawl all the data at once, and it needs to return the result immediately, and the amount of crawling data is large, and the crawling time is long, thereby causing the output display speed of the crawling data. Slower.
发明内容Summary of the invention
根据本申请公开的各种实施例,提供一种网站数据爬取方法、装置、计算机设备及可读存储介质。According to various embodiments disclosed herein, a website data crawling method, apparatus, computer device, and readable storage medium are provided.
一种网站数据爬取方法,包括:A method for crawling website data, including:
获取待爬取网站数据的数据标识和产生日期;Obtaining the data identification and date of generation of the website data to be crawled;
获取本地存储的与所述数据标识对应的网站数据的产生日期;Obtaining a date of generating the locally stored website data corresponding to the data identifier;
当所述待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, crawling the website data to be crawled before the date of generation of the website data stored locally;
输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;Outputting the crawled website date of the website to be crawled before the date of generation of the locally stored website data;
将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较;及Comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and
当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与所述待爬取网站数据的产生日期相同的网站数据。When the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of the website data to be crawled is output.
一种网站数据爬取装置,包括:A website data crawling device comprising:
获取模块,用于获取待爬取网站数据的数据标识和产生日期;获取本地存储的与所述数据标识对应的网站数据的产生日期;An obtaining module, configured to acquire a data identifier and a date of creation of the website data to be crawled; and obtain a date of generating the locally stored website data corresponding to the data identifier;
爬取模块,用于当所述待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;a crawling module, configured to: when the date of creation of the to-be-crawled website data is different from the date of generation of the locally stored website data, crawling the website data to be crawled before the date of generation of the website data stored locally ;
第一输出模块,用于输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;a first output module, configured to output the website data to be crawled before the date of generation of the locally stored website data by the generated date;
比较模块,用于将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较;及a comparison module for comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and
第二输出模块,用于当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与所述待爬取网站数据的产生日期相同的网站数据。a second output module, configured to: when the format of the crawled website data to be crawled is the same as the format of the locally stored website data, output a website whose local storage date is the same as the date of the website data to be crawled data.
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps:
获取待爬取网站数据的数据标识和产生日期;Obtaining the data identification and date of generation of the website data to be crawled;
获取本地存储的与所述数据标识对应的网站数据的产生日期;Obtaining a date of generating the locally stored website data corresponding to the data identifier;
当所述待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, crawling the website data to be crawled before the date of generation of the website data stored locally;
输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;Outputting the crawled website date of the website to be crawled before the date of generation of the locally stored website data;
将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较;及Comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and
当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与所述待爬取网站数据的产生日期相同的网站数据。When the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of the website data to be crawled is output.
一个或多个存储有计算机可读指令的非易失性计算机可读指令存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:One or more non-transitory computer readable instruction storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of:
获取待爬取网站数据的数据标识和产生日期;Obtaining the data identification and date of generation of the website data to be crawled;
获取本地存储的与所述数据标识对应的网站数据的产生日期;Obtaining a date of generating the locally stored website data corresponding to the data identifier;
当所述待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站 数据;When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, crawling the website data to be crawled before the date of generation of the website data stored locally;
输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;Outputting the crawled website date of the website to be crawled before the date of generation of the locally stored website data;
将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较;及Comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and
当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与所述待爬取网站数据的产生日期相同的网站数据。When the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of the website data to be crawled is output.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the present invention will be apparent from the description, drawings and claims.
附图说明DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings to be used in the embodiments will be briefly described below. Obviously, the drawings in the following description are only some embodiments of the present application, Those skilled in the art can also obtain other drawings based on these drawings without any creative work.
图1为根据一个或多个实施例中网站数据爬取方法的应用环境图。FIG. 1 is an application environment diagram of a website data crawling method according to one or more embodiments.
图2为根据一个或多个实施例中网站数据爬取方法的流程图。2 is a flow diagram of a method of crawling a website data in accordance with one or more embodiments.
图3为根据一个或多个实施例中网站数据爬取方法的时序图。3 is a timing diagram of a website data crawling method in accordance with one or more embodiments.
图4为根据一个或多个实施例中分段爬取步骤的流程图。4 is a flow diagram of a segmentation crawling step in accordance with one or more embodiments.
图5为图2所示实施例中的步骤S210的流程图。FIG. 5 is a flow chart of step S210 in the embodiment shown in FIG. 2.
图6为根据一个或多个实施例中的网站数据爬取装置的框图。6 is a block diagram of a website data crawler in accordance with one or more embodiments.
图7为根据一个或多个实施例中的爬虫终端的框图。FIG. 7 is a block diagram of a crawler terminal in accordance with one or more embodiments.
具体实施方式Detailed ways
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用于解释本申请,并不用于限定本申请。In order to make the technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.
参阅图1,图1为一实施例中网站数据爬取方法的应用环境图,包括处于互联网中的目标网站的服务器以及爬虫终端,爬虫终端可以包括URL爬取端、INFO爬取端、Format爬取端以及数据库,数据库可以包括应用数据以及搜索引擎的索引(目标网站的标识)等。在首次爬取时,首先,运营人员会选定待爬取的目标网站,将目标网站录入站源表sitelist,然后URL爬取端会读取该站源表sitelist,并存入map(映射),并制定站源表中站点的正则解析规则。其次,根据制定的正则解析规则,URL爬取端爬取对应的URL列 表。第三,INFO爬取端从数据库的URL列表读出URL及其相应的XPath规则(XPath,即为XML路径语言(XMLPathLanguage)它是一种用来确定XML文档中某部分位置的语言),然后爬取URL对应的每个网页,并根据XPath规则提取有价值的资源,并将提取的资源存入原始数据表originalresource。最后,Format爬取端从数据库原始数据表originalresource提取数据,进行进一步的规整、聚合,最终存入规整内容表中。Referring to FIG. 1, FIG. 1 is an application environment diagram of a website data crawling method according to an embodiment, including a server of a target website and a crawler terminal in the Internet, and the crawler terminal may include a URL crawling end, an INFO crawling end, and a Format crawling. The client and the database can include application data and an index of the search engine (identity of the target website). In the first crawl, first, the operator will select the target website to be crawled, enter the target website into the station source list sitelist, and then the URL crawler will read the station source table sitelist and store it in the map (map). And formulate the regular parsing rules for the sites in the station source table. Second, according to the established regular parsing rules, the URL crawler crawls the corresponding URL list. Third, the INFO crawler reads the URL and its corresponding XPath rule from the database's URL list (XPath, which is the XMLPath Language (XMLPathLanguage), which is a language used to determine the location of a part of an XML document), and then Crawl each web page corresponding to the URL, extract the valuable resources according to the XPath rules, and store the extracted resources into the original data table originalresource. Finally, the Format crawler extracts data from the database raw data table originalresource, performs further regularization, aggregation, and finally stores it in the regular content table.
请参阅图2,在其中一个实施例中,提供一种网站数据爬取方法,本实施例以该方法应用到上述图1中的网站数据爬取方法的应用环境图中的爬虫终端来举例说明。该爬虫终端上运行有网站数据爬取可读指令,通过该网站数据爬取可读指令来实施网站数据爬取方法。该方法具体包括如下步骤:Referring to FIG. 2, in one embodiment, a website data crawling method is provided. The embodiment is applied to the crawler terminal in the application environment diagram of the website data crawling method in FIG. 1 to illustrate. . The crawler terminal runs a website data crawling readable instruction, and implements a website data crawling method by crawling the readable instructions of the website data. The method specifically includes the following steps:
S202:获取待爬取网站数据的数据标识和产生日期。S202: Obtain a data identifier and a date of generation of the website data to be crawled.
具体地,待爬取网站数据是显示在网页中的数据,其可以是账单数据、购物记录数据、测试数据等,在此不做限制。Specifically, the website data to be crawled is the data displayed in the webpage, which may be billing data, shopping record data, test data, etc., and is not limited herein.
待爬取网站数据的数据标识是指可以唯一确定待爬取网站数据的标识,该数据标识可以是通过网站数据所属的网站URL地址、用户名等来确定。例如当待爬取网站数据为账单数据时,该数据标识可以根据网站URL地址、用户名以及账单标识生成,当待爬取网站数据为购物记录时,该数据标识可以根据网站URL地址、卖家名称以及买家账户生成。The data identifier of the website data to be crawled refers to an identifier that can uniquely determine the data of the website to be crawled, and the data identifier may be determined by the website URL address, the user name, and the like to which the website data belongs. For example, when the website data to be crawled is billing data, the data identifier may be generated according to the website URL address, the user name, and the billing identifier. When the website data to be crawled is a shopping record, the data identifier may be based on the website URL address and the seller name. And buyer account generation.
待爬取网站数据的产生日期是指待爬取网站数据所涉及的日期,其可以具体只某一日、某一月或某一年,也可以指一个日期范围,例如从6月1号到9月1号。例如当待爬取网站数据为账单数据时,则待爬取网站数据的产生日期为账单日期。当待爬取网站数据为购物记录数据时,则产生日期为下单时的日期,例如当涉及多个购物记录时,则可能存在多个产生日期。The date when the data of the website to be crawled is the date involved in crawling the website data, which may be specific to a certain day, month or year, or a date range, for example, from June 1st. September 1st. For example, when the website data to be crawled is billing data, the date of generation of the website data to be crawled is the billing date. When the website data to be crawled is the shopping record data, the date when the date is placed is generated, for example, when multiple shopping records are involved, there may be multiple generation dates.
S204:获取本地存储的与数据标识对应的网站数据的产生日期。S204: Acquire a date of generating the locally stored website data corresponding to the data identifier.
具体地,由于在上一次爬取过程中,爬取终端将爬取的网站数据存储在本地,例如上一次爬取了7月1号到8月1号的账单数据,本次需要爬取6月1号到9月1号的账单数据,则由于本地存储有7月1号到8月1号的账单数据,爬虫终端不需要再次爬取该些账单数据。Specifically, since the crawling terminal stores the crawled website data locally during the last crawling process, for example, the last time the billing data of July 1st to August 1st is crawled, the current crawling needs to be 6 From the billing data of the month 1st to the September 1st, since the billing data of July 1st to August 1st is stored locally, the crawler terminal does not need to crawl the billing data again.
S206:当待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据。S206: When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, the website data to be crawled before the date of generation of the website data stored locally is generated.
具体地,待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同是指两者所涉及的日期范围不同,例如上例中,待爬取网站数据的产生日期为6月1号至9月1号,而本地存储的网站数据的产生日期是7月1号至8月1号,由于8月2号至9月1号的账单数据在本地没有存储,可以先 爬取8月2号至9月1号的账单数据,即产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据。Specifically, the date when the data of the website to be crawled is different from the date of the website data stored locally means that the date ranges involved are different. For example, in the above example, the date of the website data to be crawled is June 1 Until September 1st, the locally stored website data is generated from July 1st to August 1st. Since the billing data from August 2nd to September 1st is not stored locally, you can climb August first. The billing data from the 2nd to the September 1st, that is, the website data to be crawled before the date of the generation of the website data stored locally.
S208:输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据。S208: Output the crawled website data to be crawled before the date of generation of the locally stored website data.
具体地,一方面,爬虫终端可以通过第一线程去爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据,并实时将所爬取到的数据向用户展现,以保证数据显示速度,提高用户体验。另一方面,爬虫终端可以通过第二线程来将新爬取到的待爬取网站数据的格式与本地存储的网站数据的格式进行比较。例如,由于产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据的量较大,爬虫终端可以分期去爬取该网站数据,例如可以先爬取8月25号至9月1号的待爬取网站数据,当爬取到8月25号至9月1号的待爬取网站数据时,则触发第二线程来比较8月25号至9月1号的待爬取网站数据与本地存储的7月1号至8月1号的网站数据的格式是否相同,同时第一线程继续爬取8月2号至8月25号的待爬取网站数据。Specifically, on the one hand, the crawler terminal can crawl the website data to be crawled before the date of generation of the website data stored locally by the first thread, and display the crawled data to the user in real time to ensure that the data is crawled to the user. Data shows speed and improves user experience. On the other hand, the crawler terminal can compare the format of the newly crawled website data to be crawled with the format of the locally stored website data through the second thread. For example, since the amount of data of the website to be crawled before the date of generation of the website data stored locally is generated, the crawler terminal can climb the website data in stages, for example, it can be crawled from August 25 to September 1 No. Crawling website data, when crawling to the website data to be crawled from August 25th to September 1st, trigger the second thread to compare the websites to be crawled from August 25th to September 1st. The data is the same as the format of the website data stored locally from July 1st to August 1st, and the first thread continues to crawl the data of the website to be crawled from August 2nd to August 25th.
S210:将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较。S210: Compare the format of the crawled website data to be crawled with the format of the locally stored website data.
具体地,待爬取网站数据的格式是指待爬取网站数据的显示格式,例如其可以是通过表格进行显示,且表格中包括5个字段,通过比较待爬取网站数据的格式与本地存储的网站数据的格式来判断本地存储的网站数据是否为脏数据,即只有目标网站中待爬取网站数据的格式与本地存储的网站数据的格式一致时,才认定本地存储的网站数据为有效数据,可以直接输出显示,给用户查看。Specifically, the format of the website data to be crawled refers to a display format of the website data to be crawled, for example, it may be displayed through a table, and the form includes five fields, by comparing the format of the website data to be crawled and the local storage. The format of the website data to determine whether the locally stored website data is dirty data, that is, only the format of the website data to be crawled in the target website is consistent with the format of the locally stored website data, and the locally stored website data is determined to be valid data. , you can directly output the display for the user to view.
S212:当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与待爬取网站数据的产生日期相同的网站数据。S212: When the format of the crawled website data that is crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of the website data to be crawled is generated.
具体地,当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,即待爬取网站数据所属的网站未改变,其数据格式未改变,从而可以直接输出本地存储的网站数据,减少爬取终端的爬取数据量,从而可以提高爬取数据的输出显示速度。Specifically, when the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website to which the website data to be crawled belongs is unchanged, and the data format thereof is unchanged, so that the local storage can be directly output. The website data reduces the amount of crawled data of the crawling terminal, thereby improving the output display speed of the crawled data.
上述的网站数据爬取方法、装置、计算机设备及可读存储介质,在爬取待爬取数据之前,首先根据数据标识获取与本地存储的网站数据,当本地存储的网站数据与待爬取的网站数据的产生日期存在不同时,则先爬取产生日期在前的一部分数据并输出显示,且当爬取的待爬取数据的格式与本地存储的网站数据的格式相同时,则不再需要爬取与本地存储的网站数据格式相同的待爬取网站数据,而是直接输出本地存储的网站数据,减少爬取的数据量, 从而可以提高爬取数据的输出显示速度。The above-mentioned website data crawling method, device, computer device and readable storage medium first obtain the website data stored locally according to the data identifier before crawling the data to be crawled, when the locally stored website data and the website data to be crawled When the date of generation of the website data is different, the part of the data with the date before the date is first crawled and outputted, and when the format of the crawled data to be crawled is the same as the format of the locally stored website data, the data is no longer needed. Crawling the website data to be crawled in the same format as the locally stored website data, but directly outputting the locally stored website data, reducing the amount of data crawled, thereby improving the output display speed of the crawl data.
在其中一个实施例中,该网站数据爬取方法还可以包括:当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同时,则继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据;输出所爬取的产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据。In one embodiment, the website data crawling method may further include: when the format of the crawled website data to be crawled is different from the format of the locally stored website data, continuing to crawl the generated date and the local storage. The website data is generated with the same date to be crawled; the output of the crawled website is the same as the date of the locally stored website data.
本实施例中,首先将已经爬取到的待爬取网站数据的格式与本地存储的网站数据的格式进行比较,当两者格式不相同时,才会继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据,这样即保证用户可以实时观看到显示的待爬取网站数据,又可以根据需要分段进行爬取,提高爬取的效率。In this embodiment, the format of the website data to be crawled that has been crawled is first compared with the format of the website data stored locally, and when the formats of the two are different, the website that generates the date and the local storage is continuously crawled. The data generation date is the same as the website data to be crawled, so that the user can view the displayed website data to be crawled in real time, and can climb and climb according to the needs, thereby improving the efficiency of crawling.
在其中一个实施例中,该网站数据爬取方法还可以包括:当存在产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据时,则继续爬取产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据;输出所爬取的产生日期在本地存储的网站数据的产生日期之后待爬取网站数据。In one embodiment, the website data crawling method may further include: when there is a website data to be crawled after the date of generation of the website data stored locally, the crawling of the website that generates the date locally is continued. The website data to be crawled after the date of the data generation; the date of the output of the crawl is to be crawled after the date of generation of the locally stored website data.
本实施例中,当待爬取的网站数据既包括产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据,还包括产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据,则先爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据,再爬取产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据,将待爬取网站数据进行了分段爬取,即保证用户可以实时观看到显示的待爬取网站数据,又可以提高爬取的效率。In this embodiment, when the website data to be crawled includes both the website data to be crawled after the date of generation of the website data stored locally, and the date to be crawled before the date of generation of the website data stored locally. Taking the website data, first crawling the website data to be crawled before the date of generation of the website data stored locally, and then crawling the website data to be crawled after the date of generation of the website data stored on the local date, The crawling of the website data is segmented and crawled, that is, the user can watch the displayed website data to be crawled in real time, and the crawling efficiency can be improved.
参阅图3,图3为一实施例中网站数据爬取方法的时序图,其中该网站数据爬取方法包括:Referring to FIG. 3, FIG. 3 is a sequence diagram of a method for crawling a website data according to an embodiment, wherein the method for crawling the website data includes:
首先用户终端向爬虫终端发送爬取请求,例如爬取6月1号至9月1号的账单数据,爬虫终端首先查询本地数据库中已存储的账单数据,如果本地数据库中已存储的账单数据为7月1号至8月1号,则爬虫终端首先从账单网页爬取8月2号至9月1号的账单数据,通过第一线程将所爬取到的账单数据返回至用户终端。First, the user terminal sends a crawl request to the crawler terminal, for example, crawling the billing data from June 1st to September 1st, and the crawler terminal first queries the stored billing data in the local database, if the stored billing data in the local database is From July 1st to August 1st, the crawler terminal first crawls the billing data from August 2nd to September 1st from the billing page, and returns the billed data that is crawled to the user terminal through the first thread.
然后爬虫终端通过第二线程将所爬取到的账单数据的格式与本地存储的账单数据的格式进行比较,如果本地存储的账单数据的格式与所爬取的账单数据的格式不同,则标记本地数据库中存储的账单数据为脏数据,且继续爬取7月1号至8月1号的账单数据,并将所爬取的账单数据发送到用户终端。如果本地存储的账单数据的格式与所爬取的账单数据的格式相同时,则直接将本地数据库中存储的账单数据发送到用户终端,即不再需要再次爬取7月1号至8月1号的账单数据。The crawler terminal then compares the format of the captured billing data with the format of the locally stored billing data by the second thread, and marks the local if the format of the locally stored billing data is different from the format of the billed data that is crawled. The billing data stored in the database is dirty data, and the billing data of July 1st to August 1st is continuously crawled, and the crawled billing data is sent to the user terminal. If the format of the locally stored billing data is the same as the format of the billed data that is crawled, the billing data stored in the local database is directly sent to the user terminal, that is, it is no longer necessary to crawl again from July 1 to August 1 Billing data.
最后,爬虫终端需要判断待爬取的账单数据是否爬取完成,即是否存在 未爬取的账单数据,例如本实施例中6月1号至6月30号的账单数据,如果存在,则继续爬取6月1号至6月30号的账单数据,并将所爬取到的账单数据返回至用户终端。Finally, the crawler terminal needs to determine whether the billing data to be crawled is crawled, that is, whether there is uncrawled billing data, such as billing data from June 1 to June 30 in this embodiment, and if so, continue Crawl the billing data from June 1st to June 30th and return the billed data to the user terminal.
上述实施例中,将待爬取的网站数据划分为产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据,产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据以及产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据,爬虫终端首先爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据,即8月2号至9月1号的账单数据,然后通过比较所爬取的待爬取网站数据与本地存储的网站数据的格式是否改变来确定本场地存储的网站数据是否可以直接使用,即通过比较待爬取网站数据的格式与本地存储的网站数据的格式来判断本地存储的网站数据是否为脏数据,即当目标网站中待爬取网站数据的格式改变时,则会导致本地存储的网站数据与待爬取网站数据的格式不同,且尤其是待爬取网站数据增加一个字段等,则会导致本地存储的网站数据缺少一定信息,因此在直接使用本地存储的网站数据之前需要首先判断本地存储的网站数据的格式。当两者格式相同时,则直接将本地存储的网站数据发送到用户终端进行显示,当存在产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据时,则继续爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据,并将爬取到的网站数据发送到用户终端,减少爬取的数据量,从而可以提高爬取数据的输出显示速度。In the above embodiment, the website data to be crawled is divided into the website data to be crawled before the date when the website data stored locally is generated, and the website to be crawled has the same date as the date of the locally stored website data. The data and the website data to be crawled after the date of generation of the website data stored locally, the crawler terminal first crawls the website data to be crawled before the date of generation of the website data stored locally, that is, August 2 The billing data until September 1st, and then by comparing whether the format of the crawled website data and the locally stored website data are changed to determine whether the website data stored in the venue can be directly used, that is, by comparing the to-be-crawled The format of the website data and the format of the locally stored website data to determine whether the locally stored website data is dirty data, that is, when the format of the website data to be crawled in the target website is changed, the website data stored locally and the website to be crawled are caused to be crawled. Take the format of the website data differently, and especially add a field to the website data to be crawled, etc. The website data that causes the local storage lacks certain information, so it is necessary to first determine the format of the locally stored website data before directly using the locally stored website data. When the format of the two is the same, the locally stored website data is directly sent to the user terminal for display, and when there is the website data to be crawled before the date of the generation of the website data stored locally, the crawling date is continued. The website data to be crawled before the date of the local stored website data is generated, and the crawled website data is sent to the user terminal, thereby reducing the amount of data crawled, thereby improving the output display speed of the crawl data.
在其中一个实施例中,请参阅图4,图4为一实施例中分段爬取步骤的流程图,网络数据爬取方法还包括一分段爬取步骤,该分段爬取步骤可以用于爬取继续爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据,产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据以及产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据中,本实施例以产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据为例进行说明,该分段爬取的步骤可以包括:In one embodiment, please refer to FIG. 4. FIG. 4 is a flowchart of a step-by-step crawling step in an embodiment. The network data crawling method further includes a segment crawling step, and the segment crawling step can be used. Crawling continues to crawl the website data to be crawled before the date of generation of the locally stored website data, and the date to be crawled is the same as the date of the locally stored website data, and the date of generation is locally stored. In the data to be crawled after the date of the generation of the website data, the embodiment is described by taking the data of the website to be crawled having the same date as the date of the website data stored locally as an example. The step of the step crawling may include :
S402:当与本地存储的网站数据的产生日期相同的待爬取网站数据的产生日期大于预设长度时,则依次分段爬取与本地存储的网站数据的产生日期相同的待爬取网站数据。S402: When the date of generation of the website data to be crawled with the same date of the website data stored locally is greater than the preset length, the website data of the website to be crawled having the same date as the website data stored locally is crawled in sequence. .
具体地,预设长度是指待爬取网站数据的长度,其中一条数据为一个长度,例如账单数据,账单中存储有10条数据,则数据长度为10。预设长度是根据爬虫终端一次所能读取的数据量或者用户终端的网页界面一次所能显示的数据量来设定的,例如可以设置预设长度为10条、15条、12条等,在此不做限制。Specifically, the preset length refers to the length of the website data to be crawled, wherein one piece of data is one length, such as billing data, and 10 pieces of data are stored in the bill, and the data length is 10. The preset length is set according to the amount of data that the crawler terminal can read at one time or the amount of data that can be displayed by the web interface of the user terminal at one time. For example, the preset length can be set to 10, 15 or 12, etc. There are no restrictions here.
此处仍以上文的例子进行说明,例如与本地存储的网站数据的产生日期相同的账单数据为7月1号至8月1号中,其中存储有35条数据,则爬虫终端依据产生日期的前后,先爬取产生日期在前的数据,例如先爬取7月25号至8月1号的10条账单数据,然后再爬取7月15号到7月24号的10条账单数据,再爬取7月5号至7月14号的10条账单数据,最后再爬取7月1号至7月4号的5条账单数据。The example above is still described here. For example, the billing data with the same date of generation of the locally stored website data is from July 1 to August 1, in which 35 pieces of data are stored, and the crawler terminal is based on the date of generation. Before and after, first crawl the data with the date before the date, for example, first climb 10 bill data from July 25th to August 1st, and then climb 10 bill data from July 15th to July 24th. Then climb the 10 billing data from July 5th to July 14th, and finally climb the 5 billing data from July 1st to July 4th.
S404:分段输出所爬取的与本地存储的网站数据的产生日期相同的待爬取网站数据。S404: The segmentation output crawls the website data to be crawled with the same date as the locally stored website data.
具体地,当爬虫终端爬取到账单数据时,则输出该账单数据,例如当爬虫终端爬取到7月25号至8月1号的10条账单数据时,则将7月25号至8月1号的10条账单数据发送到用户终端进行显示,然后再爬取7月15号到7月24号的10条账单数据,再将所爬取到的7月15号到7月24号的10条账单数据发送至用户终端进行显示,依次类推,直至爬取完成。此外,爬虫终端还可以通过一个线程爬取到7月25号至8月1号的10条账单数据,再通过另一个线程将7月25号至8月1号的10条账单数据发送到用户终端进行显示,而原线程仍继续爬取7月15号到7月24号的10条账单数据,当原线程爬取到了爬取7月15号到7月24号的10条账单数据时,另一个线程则将所爬取到的7月15号到7月24号的10条账单数据发送至用户终端进行显示,依次类推,直至爬取完成。Specifically, when the crawler terminal climbs to the billing data, the billing data is output, for example, when the crawler terminal climbs 10 billing data from July 25 to August 1, then July 25 to 8 The 10 billing data of the month 1 is sent to the user terminal for display, and then 10 billing data from July 15th to July 24th is crawled, and then the crawled July 15th to July 24th The 10 billing data is sent to the user terminal for display, and so on, until the crawling is completed. In addition, the crawler terminal can also crawl 10 billing data from July 25th to August 1st through one thread, and send 10 billing data from July 25th to August 1st to another user through another thread. The terminal displays, and the original thread continues to crawl 10 billing data from July 15th to July 24th. When the original thread climbs to 10 billing data from July 15th to July 24th, The other thread sends the 10 billing data of the crawled July 15th to July 24th to the user terminal for display, and so on, until the crawling is completed.
上述实施例中,为了采用分段爬取的方式,一方面爬取待爬取网络数据,一方面将已爬取的网络数据发送到用户终端进行显示,兼顾用户体验和爬取效率。In the above embodiment, in order to adopt the method of segmentation crawling, on the one hand, the network data to be crawled is crawled, and on the one hand, the crawled network data is sent to the user terminal for display, taking into account the user experience and the crawling efficiency.
在其中一个实施例中,请参阅图5,图5为图2所示实施例中的步骤S210的流程图,该步骤S210,即将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较的步骤可以包括:In one embodiment, please refer to FIG. 5. FIG. 5 is a flowchart of step S210 in the embodiment shown in FIG. 2. The step S210 is a format of the crawled website data to be crawled and a locally stored website. The steps of comparing the format of the data may include:
S502:将所爬取的待爬取网站数据的字段与本地存储的网站数据的字段进行匹配。S502: Match the field of the crawled website data to be crawled with the field of the locally stored website data.
具体地,待爬取网站数据的字段即待爬取网站数据所涉及的内容,例如一条账单数据可能涉及姓名、收款方、付款时间、付款金额等字段,将待爬取网站数据的字段与本地存储的网站数据的字段进行匹配,例如当待爬取网站数据的字段为姓名、收款方、付款时间、付款金额以及原由,而本地存储的网站数据的字段为姓名、收款方、付款时间、付款金额,则认为所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不匹配,即除非两者字段的内容完全相同,否则都认为所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不匹配。Specifically, the field to be crawled of the website data is the content involved in crawling the website data, for example, a billing data may relate to a name, a payee, a payment time, a payment amount, and the like, and a field to be crawled on the website data and The fields of the locally stored website data are matched, for example, the fields of the website data to be crawled are the name, the payee, the payment time, the payment amount, and the reason, and the fields of the locally stored website data are the name, the payee, and the payment. Time, payment amount, it is considered that the field of the crawled website data that is crawled does not match the field of the locally stored website data, that is, unless the contents of the two fields are identical, the crawled website to be crawled is considered The fields of the data do not match the fields of the locally stored website data.
S504:当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段相匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同。S504: When the field of the crawled website data that is crawled matches the field of the locally stored website data, the format of the crawled website data that is crawled is the same as the format of the locally stored website data.
S506:当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同。S506: When the field of the crawled website data that is crawled does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
具体地,当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段相匹配时,即所爬取的网站数据的字段与本地存储的网站数据的字段完全相同,则说明本地存储的网站数据是可用数据,因此直接将本地存储的网站数据发送到用户终端进行显示即可,不再需要再次爬取该些网站数据。当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不相匹配时,即所爬取的网站数据的字段与本地存储的网站数据的字段不完全相同,则说明本地存储的网站数据是脏数据,因此爬虫终端需要爬取该些待爬取网络数据,并将所爬取的待爬取网络数据发送到用户终端进行显示。Specifically, when the field of the crawled website data that is crawled matches the field of the locally stored website data, that is, the field of the crawled website data is completely the same as the field of the locally stored website data, the local storage is indicated. The website data is available data, so the locally stored website data is directly sent to the user terminal for display, and it is no longer necessary to crawl the website data again. When the field of the crawled website data that is crawled does not match the field of the locally stored website data, that is, the field of the crawled website data is not completely the same as the field of the locally stored website data, the local storage is The website data is dirty data, so the crawler terminal needs to crawl the data to be crawled and send the crawled network data to be crawled to the user terminal for display.
上述实施例中,通过判断所爬取的待爬取网站数据的字段与本地存储的网站数据的字段是否相匹配,来确定所爬取的待爬取网站数据与本地存储的网站数据的格式是否相同,判断逻辑简单。In the above embodiment, whether the format of the crawled website data to be crawled and the locally stored website data is determined by determining whether the field of the crawled website data to be crawled matches the field of the locally stored website data. The same, the judgment logic is simple.
应该理解的是,虽然图2-5的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-5中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowcharts of FIGS. 2-5 are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in Figures 2-5 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, these sub-steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with at least a portion of other steps or sub-steps or stages of other steps.
参阅图6,图6为一实施例中的网站数据爬取装置的框图,该网站数据爬取装置包括:Referring to FIG. 6, FIG. 6 is a block diagram of a website data crawling device in an embodiment, where the website data crawling device includes:
获取模块100,用于获取待爬取网站数据的数据标识和产生日期;获取本地存储的与数据标识对应的网站数据的产生日期。The obtaining module 100 is configured to obtain a data identifier and a date of creation of the website data to be crawled, and obtain a date of generating the website data corresponding to the data identifier stored locally.
爬取模块200,用于当待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据。The crawling module 200 is configured to: when the date of creation of the website data to be crawled is different from the date of generation of the locally stored website data, crawl the website data to be crawled before the date of generation of the website data stored locally.
第一输出模块300,用于输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据。The first output module 300 is configured to output the to-be-crawled website data before the date when the crawled generated date is locally stored.
比较模块400,用于将所爬取的待爬取网站数据的格式与本地存储的网 站数据的格式进行比较。及The comparison module 400 is configured to compare the format of the crawled website data to be crawled with the format of the locally stored website data. and
第二输出模块500,用于当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与待爬取网站数据的产生日期相同的网站数据。The second output module 500 is configured to: when the format of the crawled website data to be crawled is the same as the format of the locally stored website data, output the website data with the same generated date and the date of the website data to be crawled. .
在其中一个实施例中,爬取模块200还可以用于当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同时,则继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据。及In one embodiment, the crawling module 200 can also be configured to continue to crawl the generated date and the locally stored website data when the format of the crawled website data to be crawled is different from the format of the locally stored website data. The date of the website to be crawled with the same date. and
第一输出模块300还用于输出所爬取的产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据。The first output module 300 is further configured to output the to-be-crawled website data whose generated date is the same as the date of generation of the locally stored website data.
在其中一个实施例中,爬取模块200还可以用于当存在产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据时,则继续爬取产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据。及In one embodiment, the crawl module 200 can also be configured to continue crawling the website data stored locally on the date when the website data to be crawled after the date of generation of the website data stored locally is generated. The website data to be crawled after the date is generated. and
第二输出模块500还用于输出所爬取的产生日期在本地存储的网站数据的产生日期之后待爬取网站数据。The second output module 500 is further configured to output the crawled website date to be crawled after the date of generation of the locally stored website data.
在其中一个实施例中,爬取模块200还可以用于当与本地存储的网站数据的产生日期相同的待爬取网站数据的产生日期大于预设长度时,则依次分段爬取与本地存储的网站数据的产生日期相同的待爬取网站数据;及分段输出所爬取的与本地存储的网站数据的产生日期相同的待爬取网站数据。In one embodiment, the crawl module 200 can also be configured to sequentially climb and store the data when the date of the website data to be crawled is the same as the date of the website data generated locally. The website data is generated with the same website data to be crawled; and the segmentation output crawls the website data to be crawled with the same date as the locally stored website data.
在其中一个实施例中,比较模块400还可以用于将所爬取的待爬取网站数据的字段与本地存储的网站数据的字段进行匹配;当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段相匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同;及当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同。In one embodiment, the comparison module 400 is further configured to match the crawled field of the website data to be crawled with the field of the locally stored website data; when the field of the crawled website data is locally and locally When the fields of the stored website data match, the format of the crawled website data crawled is the same as the format of the locally stored website data; and the field of the crawled website data to be crawled and the locally stored website data When the fields do not match, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
关于网站数据爬取装置的具体限定可以参见上文中对于网站数据爬取方法的限定,在此不再赘述。上述网站数据爬取装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the website data crawling device, refer to the above definition of the website data crawling method, and details are not described herein again. The various modules in the above website data crawling device may be implemented in whole or in part by software, hardware, and combinations thereof. Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是爬虫终端,其内部结构图可以如图7所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计 算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种网站数据爬取方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided, which may be a crawler terminal, and its internal structure diagram may be as shown in FIG. The computer device includes a processor, memory, network interface, display screen, and input device connected by a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for operation of an operating system and computer programs in a non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to implement a website data crawling method. The display screen of the computer device may be a liquid crystal display or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or may be a button, a trackball or a touchpad provided on the computer device casing. It can also be an external keyboard, trackpad or mouse.
本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。It will be understood by those skilled in the art that the structure shown in FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied. The specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行以下步骤::获取待爬取网站数据的数据标识和产生日期;获取本地存储的与数据标识对应的网站数据的产生日期;当待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较;及当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与待爬取网站数据的产生日期相同的网站数据。A computer device comprising a memory and one or more processors, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the one or more processors to perform the following steps: obtaining a to-be-crawled The data identification and generation date of the website data; obtaining the date of generation of the website data corresponding to the data identifier stored locally; when the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, the date of the crawl generation The website data to be crawled before the date of generation of the locally stored website data; the data of the website to be crawled before the date of generation of the locally stored website data is output; the crawled website to be crawled The format of the data is compared with the format of the locally stored website data; and when the format of the crawled website data is the same as the format of the locally stored website data, the local storage generation date and the website data to be crawled are output. Generate website data with the same date.
在其中一个实施例中,处理器执行计算机可读指令时还可以实现以下步骤:当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同时,则继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据;及输出所爬取的产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据。In one embodiment, when the processor executes the computer readable instructions, the following steps may be further implemented: when the format of the crawled website data to be crawled is different from the format of the locally stored website data, the crawling date is continued. The website data to be crawled having the same date as the date of the locally stored website data; and the data of the website to be crawled whose output date is the same as the date of generation of the locally stored website data.
在其中一个实施例中,处理器执行计算机可读指令时还可以实现以下步骤:当存在产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据时,则继续爬取产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据;及输出所爬取的产生日期在本地存储的网站数据的产生日期之后待爬取网站数据。In one embodiment, when the processor executes the computer readable instructions, the following steps may be further implemented: when there is a website data to be crawled after the date of generation of the website data stored locally, the crawling date is continued. The website data to be crawled after the date of generation of the locally stored website data; and the output date of the crawled website is to be crawled after the date of generation of the locally stored website data.
在其中一个实施例中,处理器执行计算机可读指令时还可以实现以下步骤:当与本地存储的网站数据的产生日期相同的待爬取网站数据的产生日期大于预设长度时,则依次分段爬取与本地存储的网站数据的产生日期相同的待爬取网站数据;及分段输出所爬取的与本地存储的网站数据的产生日期相同的待爬取网站数据。In one embodiment, when the processor executes the computer readable instructions, the following steps may be further implemented: when the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, The segment crawls the website data to be crawled with the same date as the locally stored website data; and segments the output of the website data to be crawled which is the same as the date of the locally stored website data.
在其中一个实施例中,处理器执行计算机可读指令时还可以实现以下步 骤:将所爬取的待爬取网站数据的字段与本地存储的网站数据的字段进行匹配;当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段相匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同;及当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同。In one embodiment, the processor may further implement the following steps: the field of the crawled website data to be crawled is matched with the field of the locally stored website data; when the crawled crawl is to be crawled When the field of the website data matches the field of the locally stored website data, the format of the crawled website data is the same as the format of the locally stored website data; and when the crawled website data is crawled When the field does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
一个或多个存储有计算机可读指令的非易失性存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤::获取待爬取网站数据的数据标识和产生日期;获取本地存储的与数据标识对应的网站数据的产生日期;当待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较;及当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与待爬取网站数据的产生日期相同的网站数据。One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of: obtaining website data to be crawled Data identification and date of creation; obtaining the date of generation of the website data corresponding to the data identifier stored locally; when the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, the crawl generation date is locally The website data to be crawled before the date of generation of the stored website data; the data of the website to be crawled before the date of generation of the website data stored locally is output; the data of the website to be crawled is to be crawled The format is compared with the format of the locally stored website data; and when the format of the crawled website data is the same as the format of the locally stored website data, the generation date of the local storage and the generation of the website data to be crawled are output. Site data with the same date.
在其中一个实施例中,该计算机可读指令被处理器执行时还可以实现以下步骤:当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同时,则继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据;及输出所爬取的产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据。In one embodiment, when the computer readable instructions are executed by the processor, the following steps may be further implemented: when the format of the crawled website data to be crawled is different from the format of the locally stored website data, then the crawling continues The website data to be crawled having the same date as the date of the locally stored website data is generated; and the website data to be crawled whose output date is the same as the date of the locally stored website data is output.
在其中一个实施例中,该计算机可读指令被处理器执行时还可以实现以下步骤:当存在产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据时,则继续爬取产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据;及输出所爬取的产生日期在本地存储的网站数据的产生日期之后待爬取网站数据。In one embodiment, when the computer readable instructions are executed by the processor, the following steps may be further implemented: when there is a website data to be crawled after the date of generation of the website data stored locally, the crawling is continued. The website data to be crawled after the date of generation of the website data stored locally; and the date of generation of the crawled website is to be crawled after the date of generation of the locally stored website data.
在其中一个实施例中,该计算机可读指令被处理器执行时还可以实现以下步骤:当与本地存储的网站数据的产生日期相同的待爬取网站数据的产生日期大于预设长度时,则依次分段爬取与本地存储的网站数据的产生日期相同的待爬取网站数据;及分段输出所爬取的与本地存储的网站数据的产生日期相同的待爬取网站数据。In one embodiment, when the computer readable instructions are executed by the processor, the following steps may be further implemented: when the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, then And sequentially crawling the website data to be crawled with the same date as the locally stored website data; and segmentally outputting the website data to be crawled that is the same as the date of the locally stored website data.
在其中一个实施例中,该计算机可读指令被处理器执行时还可以实现以下步骤:将所爬取的待爬取网站数据的字段与本地存储的网站数据的字段进行匹配;当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段相匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同;及当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不匹 配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同。In one embodiment, the computer readable instructions when executed by the processor may further implement the steps of: matching the fields of the crawled website data to be crawled with the fields of the locally stored website data; When the field of the website data to be crawled matches the field of the locally stored website data, the format of the crawled website data is the same as the format of the locally stored website data; and when the crawled website is crawled When the field of the data does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by computer readable instructions, which can be stored in a non-volatile computer. The readable storage medium, which when executed, may include the flow of an embodiment of the methods as described above. Any reference to a memory, storage, database or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain. Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above-described embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, All should be considered as the scope of this manual.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims (20)

  1. 一种网站数据爬取方法,包括:A method for crawling website data, including:
    获取待爬取网站数据的数据标识和产生日期;Obtaining the data identification and date of generation of the website data to be crawled;
    获取本地存储的与所述数据标识对应的网站数据的产生日期;Obtaining a date of generating the locally stored website data corresponding to the data identifier;
    当所述待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, crawling the website data to be crawled before the date of generation of the website data stored locally;
    输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;Outputting the crawled website date of the website to be crawled before the date of generation of the locally stored website data;
    将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较;及Comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and
    当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与所述待爬取网站数据的产生日期相同的网站数据。When the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of the website data to be crawled is output.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同时,则继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据;及When the format of the crawled website data to be crawled is different from the format of the locally stored website data, the crawling of the website data to be crawled with the same date as the locally stored website data is continuously crawled; and
    输出所爬取的产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据。The crawled website data whose output date is the same as the date of the locally stored website data is output.
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method of claim 2, wherein the method further comprises:
    当存在产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据时,则继续爬取产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据;及When there is a website data to be crawled after the date of generation of the website data stored locally, the crawling website data of the date of the website data stored locally is generated; and
    输出所爬取的产生日期在本地存储的网站数据的产生日期之后待爬取网站数据。The output date of the output crawl is to be crawled after the date of generation of the locally stored website data.
  4. 根据权利要求2所述的方法,其特征在于,所述继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据,包括:The method according to claim 2, wherein the crawling to obtain the website data to be crawled having the same date as the date of the locally stored website data includes:
    当与本地存储的网站数据的产生日期相同的待爬取网站数据的产生日期大于预设长度时,则依次分段爬取与本地存储的网站数据的产生日期相同的所述待爬取网站数据;及When the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, the website data to be crawled that is the same as the date of generation of the locally stored website data is sequentially segmented. ;and
    分段输出所爬取的与本地存储的网站数据的产生日期相同的待爬取网站数据。The segmentation output crawls the website data to be crawled with the same date as the locally stored website data.
  5. 根据权利要求1所述的方法,其特征在于,所述将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较,包括:The method according to claim 1, wherein the comparing the format of the crawled website data to be crawled with the format of the locally stored website data comprises:
    将所爬取的待爬取网站数据的字段与本地存储的网站数据的字段进行匹配;Matching the fields of the crawled website data that are crawled with the fields of the locally stored website data;
    当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段相匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同;及When the field of the crawled website data that is crawled matches the field of the locally stored website data, the format of the crawled website data that is crawled is the same as the format of the locally stored website data;
    当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同。When the field of the crawled website data that is crawled does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
  6. 一种网站数据爬取装置,包括:A website data crawling device comprising:
    获取模块,用于获取待爬取网站数据的数据标识和产生日期;获取本地存储的与所述数据标识对应的网站数据的产生日期;An obtaining module, configured to acquire a data identifier and a date of creation of the website data to be crawled; and obtain a date of generating the locally stored website data corresponding to the data identifier;
    爬取模块,用于当所述待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;a crawling module, configured to: when the date of creation of the to-be-crawled website data is different from the date of generation of the locally stored website data, crawling the website data to be crawled before the date of generation of the website data stored locally ;
    第一输出模块,用于输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;a first output module, configured to output the website data to be crawled before the date of generation of the locally stored website data by the generated date;
    比较模块,用于将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较;a comparison module, configured to compare the format of the crawled website data to be crawled with the format of the locally stored website data;
    第二输出模块,用于当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与所述待爬取网站数据的产生日期相同的网站数据。a second output module, configured to: when the format of the crawled website data to be crawled is the same as the format of the locally stored website data, output a website whose local storage date is the same as the date of the website data to be crawled data.
  7. 根据权利要求6所述的装置,其特征在于,所述爬取模块还用于当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同时,则继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据;及The device according to claim 6, wherein the crawling module is further configured to continue to crawl the date of creation when the format of the crawled website data to be crawled is different from the format of the locally stored website data. The website data to be crawled with the same date as the locally stored website data; and
    所述第一输出模块还用于输出所爬取的产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据。The first output module is further configured to output the to-be-crawled website data whose generated date is the same as the date of generation of the locally stored website data.
  8. 根据权利要求7所述的装置,其特征在于,所述爬取模块还用于当存在产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据时,则继续爬取产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据;及The device according to claim 7, wherein the crawling module is further configured to: when there is a website data to be crawled after a date of generation of the website data stored locally, the crawling date is continued The website data to be crawled after the date of generation of the locally stored website data; and
    所述第二输出模块还用于输出所爬取的产生日期在本地存储的网站数据的产生日期之后待爬取网站数据。The second output module is further configured to output the crawled website date to be crawled after the date of generation of the locally stored website data.
  9. 根据权利要求7所述的装置,其特征在于,所述爬取模块还用于当与本地存储的网站数据的产生日期相同的待爬取网站数据的产生日期大于预设 长度时,则依次分段爬取与本地存储的网站数据的产生日期相同的所述待爬取网站数据;及分段输出所爬取的与本地存储的网站数据的产生日期相同的待爬取网站数据。The device according to claim 7, wherein the crawling module is further configured to: when the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, The segment crawling the to-be-crawled website data having the same date as the locally stored website data; and segmentally outputting the crawled website data that is the same as the locally generated website data.
  10. 根据权利要求6所述的装置,其特征在于,所述比较模块还用于将所爬取的待爬取网站数据的字段与本地存储的网站数据的字段进行匹配;当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段相匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同;及当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同。The device according to claim 6, wherein the comparing module is further configured to match the crawled field of the website data to be crawled with the field of the locally stored website data; when the crawled crawl is to be crawled When the field of the website data matches the field of the locally stored website data, the format of the crawled website data is the same as the format of the locally stored website data; and the field of the crawled website data to be crawled When the field of the locally stored website data does not match, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
  11. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executed by the one or more processors to cause the one or more The processors perform the following steps:
    获取待爬取网站数据的数据标识和产生日期;Obtaining the data identification and date of generation of the website data to be crawled;
    获取本地存储的与所述数据标识对应的网站数据的产生日期;Obtaining a date of generating the locally stored website data corresponding to the data identifier;
    当所述待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, crawling the website data to be crawled before the date of generation of the website data stored locally;
    输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;Outputting the crawled website date of the website to be crawled before the date of generation of the locally stored website data;
    将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较;及Comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and
    当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与所述待爬取网站数据的产生日期相同的网站数据。When the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of the website data to be crawled is output.
  12. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer apparatus according to claim 11, wherein said processor further performs the following steps when said computer readable instructions are executed:
    当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同时,则继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据;及When the format of the crawled website data to be crawled is different from the format of the locally stored website data, the crawling of the website data to be crawled with the same date as the locally stored website data is continuously crawled; and
    输出所爬取的产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据。The crawled website data whose output date is the same as the date of the locally stored website data is output.
  13. 根据权利要求12所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer apparatus according to claim 12, wherein said processor further performs the following steps when said computer readable instructions are executed:
    当存在产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据时,则继续爬取产生日期在本地存储的网站数据的产生日期之后的待爬取 网站数据;及When there is a website data to be crawled after the date of generation of the website data stored locally, the crawling website data of the date of the website data stored locally is generated; and
    输出所爬取的产生日期在本地存储的网站数据的产生日期之后待爬取网站数据。The output date of the output crawl is to be crawled after the date of generation of the locally stored website data.
  14. 根据权利要求12所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时执行的继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据,包括:The computer device according to claim 12, wherein the processor performs the computer readable instruction to execute the crawling website data of the same crawling date as the date of the locally stored website data, include:
    当与本地存储的网站数据的产生日期相同的待爬取网站数据的产生日期大于预设长度时,则依次分段爬取与本地存储的网站数据的产生日期相同的所述待爬取网站数据;及When the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, the website data to be crawled that is the same as the date of generation of the locally stored website data is sequentially segmented. ;and
    分段输出所爬取的与本地存储的网站数据的产生日期相同的待爬取网站数据。The segmentation output crawls the website data to be crawled with the same date as the locally stored website data.
  15. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行的将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较,包括:The computer device according to claim 11, wherein the processor executes the computer readable instructions to perform the format of the crawled website data to be crawled and the format of the locally stored website data. Comparison, including:
    将所爬取的待爬取网站数据的字段与本地存储的网站数据的字段进行匹配;Matching the fields of the crawled website data that are crawled with the fields of the locally stored website data;
    当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段相匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同;及When the field of the crawled website data that is crawled matches the field of the locally stored website data, the format of the crawled website data that is crawled is the same as the format of the locally stored website data;
    当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同。When the field of the crawled website data that is crawled does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
  16. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:
    获取待爬取网站数据的数据标识和产生日期;Obtaining the data identification and date of generation of the website data to be crawled;
    获取本地存储的与所述数据标识对应的网站数据的产生日期;Obtaining a date of generating the locally stored website data corresponding to the data identifier;
    当所述待爬取网站数据的产生日期与本地存储的网站数据的产生日期不同时,则爬取产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;When the date of generation of the website data to be crawled is different from the date of generation of the website data stored locally, crawling the website data to be crawled before the date of generation of the website data stored locally;
    输出所爬取的产生日期在本地存储的网站数据的产生日期之前的待爬取网站数据;Outputting the crawled website date of the website to be crawled before the date of generation of the locally stored website data;
    将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较;及Comparing the format of the crawled website data to be crawled with the format of the locally stored website data; and
    当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同时,则输出本地存储的产生日期与所述待爬取网站数据的产生日期相同的网 站数据。When the format of the crawled website data to be crawled is the same as the format of the locally stored website data, the website data whose generated date is the same as the date of generation of the website data to be crawled is output.
  17. 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:The storage medium of claim 16 wherein said computer readable instructions are further executed by said processor to perform the following steps:
    当所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同时,则继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据;及When the format of the crawled website data to be crawled is different from the format of the locally stored website data, the crawling of the website data to be crawled with the same date as the locally stored website data is continuously crawled; and
    输出所爬取的产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据。The crawled website data whose output date is the same as the date of the locally stored website data is output.
  18. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:A storage medium according to claim 17, wherein said computer readable instructions are further executed by said processor to perform the following steps:
    当存在产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据时,则继续爬取产生日期在本地存储的网站数据的产生日期之后的待爬取网站数据;及When there is a website data to be crawled after the date of generation of the website data stored locally, the crawling website data of the date of the website data stored locally is generated; and
    输出所爬取的产生日期在本地存储的网站数据的产生日期之后待爬取网站数据。The output date of the output crawl is to be crawled after the date of generation of the locally stored website data.
  19. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行的继续爬取产生日期与本地存储的网站数据的产生日期相同的待爬取网站数据,包括:The storage medium according to claim 17, wherein said computer readable instructions are executed by said processor to continue crawling to generate a website to be crawled with a date of generation of locally generated website data Data, including:
    当与本地存储的网站数据的产生日期相同的待爬取网站数据的产生日期大于预设长度时,则依次分段爬取与本地存储的网站数据的产生日期相同的所述待爬取网站数据;及When the date of generation of the website data to be crawled that is the same as the date of generation of the locally stored website data is greater than a preset length, the website data to be crawled that is the same as the date of generation of the locally stored website data is sequentially segmented. ;and
    分段输出所爬取的与本地存储的网站数据的产生日期相同的待爬取网站数据。The segmentation output crawls the website data to be crawled with the same date as the locally stored website data.
  20. 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时执行的将所爬取的待爬取网站数据的格式与本地存储的网站数据的格式进行比较,包括:The storage medium according to claim 16, wherein said computer readable instructions are executed by said processor to perform a format of the crawled website data to be crawled and a format of locally stored website data. Comparison, including:
    将所爬取的待爬取网站数据的字段与本地存储的网站数据的字段进行匹配;Matching the fields of the crawled website data that are crawled with the fields of the locally stored website data;
    当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段相匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式相同;及When the field of the crawled website data that is crawled matches the field of the locally stored website data, the format of the crawled website data that is crawled is the same as the format of the locally stored website data;
    当所爬取的待爬取网站数据的字段与本地存储的网站数据的字段不匹配时,则所爬取的待爬取网站数据的格式与本地存储的网站数据的格式不相同。When the field of the crawled website data that is crawled does not match the field of the locally stored website data, the format of the crawled website data that is crawled is different from the format of the locally stored website data.
PCT/CN2018/080126 2017-07-26 2018-03-23 Website data crawling method and apparatus, computer device and readable storage medium WO2019019673A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710620026.XA CN107729344B (en) 2017-07-26 2017-07-26 Website data crawling method and device, computer equipment and readable storage medium
CN201710620026.X 2017-07-26

Publications (1)

Publication Number Publication Date
WO2019019673A1 true WO2019019673A1 (en) 2019-01-31

Family

ID=61201694

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/080126 WO2019019673A1 (en) 2017-07-26 2018-03-23 Website data crawling method and apparatus, computer device and readable storage medium

Country Status (2)

Country Link
CN (1) CN107729344B (en)
WO (1) WO2019019673A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729344B (en) * 2017-07-26 2020-08-28 深圳壹账通智能科技有限公司 Website data crawling method and device, computer equipment and readable storage medium
CN109670100B (en) * 2018-12-21 2020-06-26 第四范式(北京)技术有限公司 Page data capturing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090307211A1 (en) * 2008-06-05 2009-12-10 International Business Machines Corporation Incremental crawling of multiple content providers using aggregation
CN102195802A (en) * 2010-03-18 2011-09-21 中兴通讯股份有限公司 Terminal software transmission method, server and terminal
CN106126716A (en) * 2016-06-30 2016-11-16 北京奇艺世纪科技有限公司 A kind of data crawling method and device
CN106980687A (en) * 2017-03-31 2017-07-25 北京奇艺世纪科技有限公司 A kind of resource downloading system, method and reptile download system
CN107729344A (en) * 2017-07-26 2018-02-23 上海壹账通金融科技有限公司 Website data crawling method, device, computer equipment and readable storage medium storing program for executing

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105592118B (en) * 2014-10-23 2018-11-13 阿里巴巴集团控股有限公司 Synchronous user applies method, system and the server-side of data
CN104516956B (en) * 2014-12-16 2017-12-01 中国科学院声学研究所 A kind of site information increment crawling method
CN106649357A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data processing method and apparatus used for crawler program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090307211A1 (en) * 2008-06-05 2009-12-10 International Business Machines Corporation Incremental crawling of multiple content providers using aggregation
CN102195802A (en) * 2010-03-18 2011-09-21 中兴通讯股份有限公司 Terminal software transmission method, server and terminal
CN106126716A (en) * 2016-06-30 2016-11-16 北京奇艺世纪科技有限公司 A kind of data crawling method and device
CN106980687A (en) * 2017-03-31 2017-07-25 北京奇艺世纪科技有限公司 A kind of resource downloading system, method and reptile download system
CN107729344A (en) * 2017-07-26 2018-02-23 上海壹账通金融科技有限公司 Website data crawling method, device, computer equipment and readable storage medium storing program for executing

Also Published As

Publication number Publication date
CN107729344B (en) 2020-08-28
CN107729344A (en) 2018-02-23

Similar Documents

Publication Publication Date Title
US20210097112A1 (en) Webpage data processing method and device, computer device and computer storage medium
WO2019214080A1 (en) Approval information processing method and apparatus, computer device, and storage medium
WO2020151333A1 (en) Page loading method, apparatus, computer device and storage medium
WO2020207034A1 (en) Method and device for generating interface test case, and storage medium and server
CN107463545A (en) A kind of generation method, electronic equipment and the storage medium of online treaty documents
TWI587158B (en) Paging display control method and device
US8560519B2 (en) Indexing and searching employing virtual documents
US20120317486A1 (en) Embedded web viewer for presentation applications
TWI683225B (en) Script generation method and device
WO2019091018A1 (en) Knowledge graph establishment method and device, computer device and storage medium
US10073826B2 (en) Providing action associated with event detected within communication
WO2019200741A1 (en) Project evaluation information processing method and apparatus, computer device, and storage medium
WO2016078530A1 (en) Method and device for verifying identity information
US10397306B2 (en) System and method for translating versioned data service requests and responses
CN110221871B (en) Webpage acquisition method and device, computer equipment and storage medium
CN105094753A (en) Method, device, and system for drawing wireframe
US10635725B2 (en) Providing app store search results
WO2019019673A1 (en) Website data crawling method and apparatus, computer device and readable storage medium
CN104657359A (en) Webpage content and style recording method by using website
TW201426337A (en) System and method for creating object files
CN113901362A (en) Webpage display method, device, equipment, storage medium and program product
US20180300424A1 (en) Systems and methods for providing structured markup content retrievable by a service that provides rich search results
US20200380071A1 (en) Autoform Filling Using Text from Optical Character Recognition and Metadata for Document Types
CN109766480B (en) Data query method and device
US20150261733A1 (en) Asset collection service through capture of content

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.05.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18838400

Country of ref document: EP

Kind code of ref document: A1