CN111444411A

CN111444411A - Network data increment acquisition method, device, equipment and storage medium

Info

Publication number: CN111444411A
Application number: CN202010242238.0A
Authority: CN
Inventors: 张振海; 廖海波
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-24

Abstract

The invention discloses a method, a device, equipment and a storage medium for acquiring network data increment, wherein the method specifically divides a page identifier and a historical page identifier into new page identifiers according to a data updating mode by judging whether the content of a page to be acquired is updated to perform acquisition action; if the historical page identifier is the historical page identifier, downloading page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a Simhash algorithm; loading the local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time and the local sensitive hash value acquired last time on the basis of a preset distance measurement algorithm; when the similarity is larger than the preset threshold value, the cached local sensitive hash value is updated, the historical page data is further analyzed and stored in the data acquisition database, the resource consumption of incremental data acquisition is reduced, the incremental acquisition efficiency of the network data is improved, and the goal of incremental acquisition is achieved to the maximum extent.

Description

Network data increment acquisition method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of financial technology (Fintech), in particular to a network data increment acquisition method, a device, equipment and a computer readable storage medium.

Background

With the development of computer technology, more and more technologies are applied to the financial field, the traditional financial industry is gradually changed to financial technology (Fintech), the incremental acquisition technology of network data is not exceptional, but due to the requirements of security and real-time performance of the financial industry, higher requirements are also provided for the incremental acquisition technology at present, the incremental data acquisition mainly comprises three modes, namely, periodically acquiring website updated data based on page link UR L deduplication, periodically acquiring website updated data based on website page content deduplication, and directly acquiring website updated data in full quantity, but the first acquisition mode cannot identify the updated data of the website with updated page content and unchanged UR L, so that the omission of the acquired data is easily caused, the second acquisition mode is too sensitive to the website updated data and has larger calculation amount, and the third acquisition mode needs to acquire all the page data, so that the data acquisition efficiency is low.

Disclosure of Invention

The invention mainly aims to provide a network data increment acquisition method, a device, equipment and a computer readable storage medium, and aims to solve the technical problems of low acquisition accuracy and low accuracy of the existing increment data acquisition method.

In order to achieve the above object, the present invention provides a network data increment acquisition method, which comprises the following steps:

acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and judging whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier;

if the page identification of the page to be collected is the historical page identification, acquiring the historical page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;

acquiring a local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time on the historical page to be acquired and a local sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm;

and when the similarity is greater than a preset threshold value, updating the cached local sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library.

Optionally, if the page identifier of the page to be collected is the historical page identifier, the step of obtaining the historical page data to be collected, and calculating a partially sensitive hash value corresponding to the historical page data according to a specific partially sensitive hash algorithm Simhash specifically includes:

if the page identifier of the page to be acquired is the historical page identifier, acquiring historical page data to be acquired, cutting a webpage irrelevant code, and reserving partial data of a webpage body as the historical page data to be acquired;

segmenting the historical page data, extracting keywords after segmentation and preset weights corresponding to the keywords after segmentation, and converting the historical page data into a vector formed by a group of weighted characteristic values;

and calculating a weighted hash value corresponding to the weighted characteristic value vector based on a specific locality sensitive hash algorithm Simhash as a locality sensitive hash value corresponding to the historical page data.

Optionally, the step of updating the cached locality-sensitive hash value, analyzing the historical page data, and storing the historical page data in the data collection library when the similarity is greater than the preset threshold specifically includes:

when the similarity is larger than a preset threshold value, analyzing the downloaded historical page data, and calculating a hash value corresponding to the key content of the historical page analysis data based on a hash algorithm and the key content of the historical page analysis data to serve as a key content data fingerprint corresponding to the historical page data;

judging whether key content data fingerprints corresponding to the historical page data exist in the cached historical data fingerprints;

and if the key content data fingerprints corresponding to the historical page data do not exist in the historical data fingerprints, storing the historical page analysis data to the data acquisition base.

Optionally, before the steps of obtaining a page to be acquired in a target website, generating a page identifier of the page to be acquired, and determining whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier, the method further includes:

acquiring an acquisition page in the target website, and generating a data acquisition task set based on the acquisition page, wherein the data acquisition task set comprises at least one page to be acquired.

Optionally, after the steps of obtaining the locally sensitive hash value of the page to be acquired last time in the cache data, and calculating the similarity between the locally sensitive hash value of the page to be acquired last time and the locally sensitive hash value corresponding to the historical page data this time based on a preset distance measurement algorithm, the method further includes:

judging whether the similarity is greater than a preset threshold value or not;

and when the similarity is smaller than a preset threshold value, judging that the page data of the page to be collected is historical collected data, stopping analysis, and acquiring the next collection task in the data collection task set for data collection.

Optionally, when the similarity is greater than a preset threshold, the method further includes, after updating the cached locality-sensitive hash value, analyzing the historical page data and storing the historical page data in a data collection library:

and updating the local sensitive hash value of the page to be acquired which corresponds to the page to be acquired and is acquired last time into the local sensitive hash value of the historical page data which corresponds to the acquisition of this time in the cache data.

Optionally, after the steps of obtaining a page to be acquired in a target website, generating a page identifier of the page to be acquired, and determining whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier, the method further includes:

if the page identification of the page to be acquired is a new page identification, generating a new page data fingerprint of the new page, and judging whether the new page data fingerprint exists in preset historical acquired data fingerprints;

and if the new page data fingerprint does not exist in the historical acquisition data fingerprint, writing the new page data fingerprint into the historical data fingerprint, downloading and analyzing the new page data and storing the new page data into the data acquisition library.

In addition, in order to achieve the above object, the present invention further provides a network data increment acquisition device, including:

the page identification judging module is used for acquiring a page to be acquired in a target website, generating a page identification of the page to be acquired and judging whether the page identification of the page to be acquired is a new page identification or a historical page identification;

the page hash value calculation module is used for acquiring the historical page data to be acquired if the page identifier of the page to be acquired is the historical page identifier, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;

the page similarity calculation module is used for acquiring the locality sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the locality sensitive hash value acquired last time on the historical page to be acquired and the locality sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm;

and the page data analysis module is used for updating the cached local sensitive hash value when the similarity is greater than a preset threshold value, analyzing the historical page data and storing the historical page data in a data acquisition library.

In addition, to achieve the above object, the present invention further provides a network data incremental acquisition device, where the network data incremental acquisition device includes: the system comprises a memory, a processor and a network data increment acquisition program which is stored on the memory and can run on the processor, wherein the network data increment acquisition program realizes the steps of the network data increment acquisition method when being executed by the processor.

In addition, to achieve the above object, the present invention further provides a computer readable storage medium, on which a network data incremental acquisition program is stored, and the network data incremental acquisition program, when executed by a processor, implements the steps of the network data incremental acquisition method as described above.

The invention provides a network data increment acquisition method, which comprises the steps of acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and judging whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier; if the page identification of the page to be collected is the historical page identification, acquiring the historical page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash; acquiring a local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time on the historical page to be acquired and a local sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm; and when the similarity is greater than a preset threshold value, updating the cached local sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library. By the mode, when the page to be acquired is determined to be the historical page, the method calculates the local sensitive hash value corresponding to the current acquisition of the page to be acquired based on the specific local sensitive hash algorithm, then calculating the similarity between the two collected page data based on the local sensitive hash value corresponding to the historical page data and the local sensitive hash value which is collected last time on the page to be collected and corresponds to the page to be collected in the cache data, therefore, whether the historical page is updated or not is determined, the problem that the existing hash algorithm is too sensitive to the change of the page data is solved, the data calculation amount is reduced, the resource consumption of incremental data acquisition is reduced, the accuracy of the incremental data acquisition is improved, the efficiency of the incremental data acquisition is improved, and the technical problems that the existing incremental data acquisition method is low in acquisition efficiency and accuracy are solved.

Drawings

FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a network data incremental acquisition method according to a first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

The network data increment acquisition equipment of the embodiment of the invention can be a PC (personal computer) or server equipment, and a Java virtual machine runs on the network data increment acquisition equipment.

As shown in fig. 1, the network data incremental acquisition device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include an operating system, a network communication module, a user interface module, and a network data incremental acquisition program therein.

In the device shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the network data incremental collecting program stored in the memory 1005 and perform the following operations in the network data incremental collecting method.

Based on the hardware structure, the embodiment of the network data increment acquisition method is provided.

Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of a network data increment acquisition method according to the present invention, where the network data increment acquisition method includes:

step S10, acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and judging whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier;

at present, the incremental acquisition of a target website acquired periodically is realized by three methods:

the first method is to remove the duplicate based on the new added webpage identifier UR L, namely, before data acquisition, whether the UR L of the current access page is the page UR L which has already acquired data is judged, if the UR L of the current access page is in the history acquisition page UR L and does not exceed the forced updating period, the acquisition of the page data is stopped, and if the page exceeds the forced updating period, the website data is considered to be updated and data acquisition is performed.

The second method is based on duplicate removal of web page content, that is, after comparing data streams returned by a target website server or analyzing content, it is determined whether the web page content has been collected, the determination process generally performs determination by calculating and comparing hash values of the web page content, and if not, the collected content is written in. But the hash value difference generated by the hash calculated by the method when the page data is not changed much is also very large. The method is too sensitive to website updating contents, and aiming at large text updating contents, the algorithm is long in time consumption and low in accuracy.

And the third mode is full acquisition, namely acquiring all newly added page data and writing the page data into a storage medium. And then judging whether the acquired data exist in the medium or not when the data are written into the storage medium, wherein the mode is used for acquiring a lot of invalid data, the acquired pages are not updated actually, the acquisition efficiency is low, a lot of acquisition resources are wasted, a lot of redundant data are generated, and the target website is accessed too frequently.

In order to solve the above problem, according to the present invention, when it is determined that a page to be acquired is a history page, based on a specific locality sensitive hash algorithm, a locality sensitive hash value corresponding to updated data of the page to be acquired is calculated, and then based on the locality sensitive hash value corresponding to the history page data and a locality sensitive hash value corresponding to a previous acquired locality sensitive hash value of the page to be acquired before the updated data of the page to be acquired in cache data, a similarity between the page data before updating and the page data after updating is calculated, thereby determining whether the page data of the history page is updated, so as to solve a problem that an existing hash algorithm is too sensitive to a change in the page data, reduce a data calculation amount, reduce Resource consumption of incremental data acquisition, improve an accuracy of incremental data acquisition, and improve efficiency of incremental data acquisition.

Further, before the step S10, the method further includes:

In the embodiment, new pages in each time interval in the target website are obtained according to a preset time interval, page UR L corresponding to each new page is added to a preset list, a data acquisition task set is generated, the data acquisition task set at least comprises one page to be acquired, then each page UR L in the data acquisition task set is sequentially obtained, and data acquisition operation of the page to be acquired is sequentially carried out until all pages in the data acquisition task set are processed.

Further, after the step S10, the method further includes:

In this embodiment, if the page UR L of the page to be acquired is the new page UR L, the page to be acquired is identified as a new generated page, a hash algorithm is called, a hash value corresponding to the page UR L to be acquired is calculated, the hash value is set as a page data identifier corresponding to the page to be acquired, such as a data ID or a data fingerprint, i.e., a uniform identification method for different data contents is performed, then a historical data identifier corresponding to the target website in a data acquisition library is obtained and added to a Redis set, and the page data identifier is compared with the Redis set.

Step S20, if the page identifier of the page to be collected is the historical page identifier, acquiring the historical page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;

in the embodiment, if the page identifier of the page to be acquired is the historical page identifier, that is, the page to be acquired is an old page, further judging whether the page to be acquired is an updated old page, acquiring historical page data of the page to be acquired, and then calculating a local Sensitive hash value corresponding to the historical page data based on a specific local Sensitive hash algorithm Simhash, wherein the local Sensitive hash (L global-Sensitive Hashing, L SH) is used for solving the problem of neighbor search of high-dimensional space mass data.

Step S30, obtaining the local sensitive hash value of the last time of the page to be collected in the cache data, and calculating the similarity between the local sensitive hash value of the last time of the historical page to be collected and the local sensitive hash value corresponding to the historical page data to be collected at this time based on a preset distance measurement algorithm;

in this embodiment, the locally sensitive hash value, which is stored in the cache data in advance and is acquired last time of the page to be acquired, is acquired. Based on a preset distance measurement algorithm, such as a hamming distance calculation method, a common euclidean distance calculation method, a minkowski distance calculation method, a cosine distance calculation method, or the like, the distance between the locally sensitive hash value corresponding to the historical page data and the locally sensitive hash value acquired last time on the page to be acquired is calculated, so as to compare the similarity of the data before and after the update of the page to be acquired. I.e. similarity is determined based on some distance between points, close point distances being close.

Further, after the step S30, the method further includes:

judging whether the similarity is greater than a preset threshold value or not;

In this embodiment, the similarity is compared with a preset threshold, and if the similarity exceeds the preset threshold, the similarity between the historical page data and the page data acquired last in the page to be acquired is higher, that is, the data difference between the page data before the update of the page to be acquired and the updated page data is smaller. And judging the page data of the page to be acquired as acquired data.

And step S40, when the similarity is larger than a preset threshold value, updating the cached local sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library.

In this embodiment, if the similarity is smaller than a preset threshold, the similarity between the historical page data and the page data acquired last in the page to be acquired is lower, that is, the data difference between the page data before the update of the page to be acquired and the page data after the update is larger. And collecting the historical page data and storing the historical page data in a data collection library. In this embodiment, a method for identifying increments by classification step by step is designed, so that a duplicate removal target is realized to the greatest extent on the basis of reducing resource consumption and reducing redundancy, and acquisition of update data of a new page and an old page of a target site is completed. A local sensitive Hash implementation algorithm simHash is introduced into the identification of whether an old page of a target site is updated or not, and the problems that the traditional Hash is too sensitive to the change of a website page, and aiming at a large text, the algorithm is long in time consumption and low in accuracy are solved.

The embodiment provides a network data increment acquisition method, which includes acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and judging whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier; if the page identification of the page to be collected is the historical page identification, acquiring the historical page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash; acquiring a local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time on the historical page to be acquired and a local sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm; and when the similarity is greater than a preset threshold value, updating the cached local sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library. By the mode, when the page to be acquired is determined to be the historical page, the method calculates the local sensitive hash value corresponding to the current acquisition of the page to be acquired based on the specific local sensitive hash algorithm, then calculating the similarity between the two collected page data based on the local sensitive hash value corresponding to the historical page data and the local sensitive hash value which is collected last time on the page to be collected and corresponds to the page to be collected in the cache data, therefore, whether the historical page is updated or not is determined, the problem that the existing hash algorithm is too sensitive to the change of the page data is solved, the data calculation amount is reduced, the resource consumption of incremental data acquisition is reduced, the accuracy of the incremental data acquisition is improved, the efficiency of the incremental data acquisition is improved, and the technical problems that the existing incremental data acquisition method is low in acquisition efficiency and accuracy are solved.

Further, based on the first embodiment of the network data incremental acquisition method of the present invention, a second embodiment of the network data incremental acquisition method of the present invention is provided.

In this embodiment, the step S20 specifically includes:

When the similarity is greater than a preset threshold, updating the cached locality sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library specifically comprises the following steps:

In this embodiment, if a target page of an acquisition task is an old page update, first downloading page data and clipping content in the page to be acquired that is not related to a page HTM L, such as relevant data corresponding to an updated deleted page style, and only preserving relevant data labeled as a body part, generating historical page data, then calculating a partially sensitive hash value Simhash value corresponding to the historical page data according to a specific partially sensitive algorithm Simhash, and calculating a partially sensitive hash value corresponding to the historical page data and a partially sensitive hash value corresponding to a last acquired partially sensitive hash value of the page to be acquired using a hamming distance algorithm, i.e., a similarity of a new and old page (wherein the partially sensitive hash value of the last acquired of the page to be acquired before updating the page to be acquired is cached in advance), wherein the calculation process of the specific partially sensitive algorithm Simhash is that the historical page data is participled, then extracting a key feature vector (featured _ n) after participling the historical page data, and then obtaining a set of corresponding weighted feature vector (a weighted hash value, i.e.g., a set of corresponding weighted hash value, and a corresponding weighted hash value, if the set of corresponding hash value is equal to a weighted hash value, the corresponding hash value, the weighted hash value is equal to a weighted hash value, then the corresponding to a weighted hash value, the weighted hash value of the corresponding hash value, the weighted hash value is calculated as a weighted hash value, the weighted hash value of the weighted hash value, the weighted hash value of the corresponding hash value is calculated as a weighted hash value of the corresponding hash value of a weighted hash value of the corresponding hash value of the weighted hash value of the corresponding to the corresponding hash value of the weighted hash value of the corresponding hash value of the weighted hash value of the corresponding hash value of the weighted hash value of the corresponding hash value of the local sensitive hash value of the weighted hash value of the local sensitive hash value of the corresponding hash value of.

In the embodiment, the acquisition action is carried out by judging whether the content of the page to be acquired is updated or not, and the new page identifier and the historical page identifier are specifically divided according to a data updating mode; if the historical page identifier is the historical page identifier, downloading page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a Simhash algorithm; loading the local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time and the local sensitive hash value acquired last time on the basis of a preset distance measurement algorithm; when the similarity is larger than the preset threshold value, the cached local sensitive hash value is updated, the historical page data is further analyzed and stored in the data acquisition database, the resource consumption of incremental data acquisition is reduced, the incremental acquisition efficiency of the network data is improved, and the goal of incremental acquisition is achieved to the maximum extent.

Further, after the step S40, the method further includes:

In this embodiment, if the local sensitivity hash values are similar to each other, the next acquisition task is performed, otherwise, the page data is analyzed, and the simhash value stored in the corresponding page to be acquired in the data is cached, that is, the locally sensitive hash value acquired last on the page to be acquired is replaced with the locally sensitive hash value corresponding to the historical page data. Extracting key contents of the analysis data, generating a data fingerprint by using the hash, continuously writing the data fingerprint into Redis to judge whether the history is collected, if so, carrying out the next collection task, otherwise, writing the analysis data into a database to complete one collection task. And after the collection task set is completely finished, the incremental collection is finished, and the next scheduling is waited.

In this embodiment, the incremental crawler system based on the locality sensitive hash algorithm plans a uniform duplicate removal identifier, that is, a data fingerprint, optimizes an incremental acquisition flow, and on the basis of UR L duplicate removal, adopts the locality sensitive hash algorithm to determine whether the page content of a target website is updated, and further determines whether the key content of the page already exists when the page is written in a storage medium, and performs duplicate removal determination step by step, thereby reducing resource consumption, improving acquisition efficiency, and achieving the goal of incremental acquisition to the maximum extent.

The invention also provides a network data increment acquisition device, which comprises:

Further, the network data increment acquisition device further comprises:

the acquisition task generating module is used for generating an acquisition task set corresponding to the incremental acquisition page of the target website and generating a data fingerprint corresponding to the acquisition page according to a hash algorithm and the UR L corresponding to the page to be acquired;

the page local sensitive hash value calculation module is used for downloading the historical page data to be acquired if the page identifier of the page to be acquired is the historical page identifier, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;

the page similarity calculation module is used for acquiring the local sensitive hash value of the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value of the page to be acquired in the last time and the local sensitive hash value corresponding to the historical page data acquisition based on a preset distance measurement algorithm;

the page data downloading module is used for downloading the target acquisition page data, namely the HTM L code;

the page data analysis module is used for analyzing the downloaded page data and extracting and formatting field values when the similarity is greater than a preset threshold value or the page identifier is a new page identifier and the data fingerprints do not have cache historical data fingerprints;

and the data storage module is used for storing the analyzed formatted data of the page to be acquired.

Further, the page hash value calculation module specifically includes:

the key data extraction unit is used for acquiring the historical page data to be acquired if the page identifier of the page to be acquired is the historical page identifier, cutting a webpage irrelevant code, and reserving partial data of a webpage body as the historical page data to be acquired;

the data word segmentation weighting unit is used for segmenting the historical page data, extracting keywords after word segmentation and preset weights corresponding to the keywords after word segmentation, and converting the historical page data into a vector formed by a group of weighted characteristic values;

and the hash value calculation unit is used for calculating the weighted hash value corresponding to the weighted characteristic value vector based on a specific locality sensitive hash algorithm Simhash, and the weighted hash value is used as the locality sensitive hash value corresponding to the historical page data.

Further, the page data acquisition module specifically includes:

the data identification calculation unit is used for analyzing the downloaded historical page data when the similarity is larger than a preset threshold value, and calculating a hash value corresponding to the key content of the historical page analysis data based on a hash algorithm and the key content of the historical page analysis data to serve as a key content data fingerprint corresponding to the historical page data;

the data identification judging unit is used for judging whether key content data fingerprints corresponding to the historical page data exist in the cached historical data fingerprints;

page data acquisition unit for

Further, the network data increment acquisition device further comprises:

and the task set generating module is used for acquiring the acquisition page in the target website and generating a data acquisition task set based on the acquisition page, wherein the data acquisition task set comprises at least one page to be acquired.

Further, the page similarity calculation module is further configured to:

judging whether the similarity is greater than a preset threshold value or not;

Further, the page similarity calculation module is further configured to:

Further, the page identifier determining module is further configured to:

The method executed by each program module can refer to each embodiment of the network data increment acquisition method of the invention, and is not described herein again.

The invention also provides a computer readable storage medium.

The computer readable storage medium of the present invention stores thereon a network data incremental acquisition program, which when executed by a processor implements the steps of the network data incremental acquisition method as described above.

The method implemented when the network data increment acquisition program running on the processor is executed may refer to each embodiment of the network data increment acquisition method of the present invention, and details are not described here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A network data increment acquisition method is characterized by comprising the following steps:

2. The method for incrementally acquiring network data according to claim 1, wherein the step of acquiring the historical page data to be acquired and calculating the locality-sensitive hash value corresponding to the historical page data according to a specific locality-sensitive hash algorithm Simhash if the page identifier of the page to be acquired is the historical page identifier specifically comprises:

3. The method for incrementally acquiring network data according to claim 2, wherein the step of updating the cached locality-sensitive hash value, parsing the historical page data, and storing the historical page data in the data acquisition library when the similarity is greater than the preset threshold specifically comprises:

4. The method for incrementally acquiring network data according to claim 1, wherein before the steps of acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and determining whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier, the method further comprises:

5. The method according to claim 4, wherein after the steps of obtaining the locally sensitive hash value of the page to be acquired in the cache data, which was acquired last time, and calculating the similarity between the locally sensitive hash value of the page to be acquired last time and the locally sensitive hash value corresponding to the historical page data acquired this time based on a preset distance measurement algorithm, the method further comprises:

judging whether the similarity is greater than a preset threshold value or not;

6. The method for incrementally acquiring network data as recited in claim 1, wherein after updating the cached locality-sensitive hash value and parsing the historical page data and storing the historical page data in a data acquisition repository when the similarity is greater than a preset threshold in the step, the method further comprises:

7. The method for incrementally acquiring network data according to any one of claims 1 to 6, wherein after the steps of acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and determining whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier, the method further comprises:

8. A network data increment acquisition device is characterized by comprising:

9. A network data incremental acquisition device, wherein the network data incremental acquisition device comprises: a memory, a processor and a network data incremental acquisition program stored on the memory and executable on the processor, the network data incremental acquisition program when executed by the processor implementing the steps of the network data incremental acquisition method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, on which a network data incremental acquisition program is stored, which when executed by a processor implements the steps of the network data incremental acquisition method according to any one of claims 1 to 7.