CN111444411A - Network data increment acquisition method, device, equipment and storage medium - Google Patents
Network data increment acquisition method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN111444411A CN111444411A CN202010242238.0A CN202010242238A CN111444411A CN 111444411 A CN111444411 A CN 111444411A CN 202010242238 A CN202010242238 A CN 202010242238A CN 111444411 A CN111444411 A CN 111444411A
- Authority
- CN
- China
- Prior art keywords
- page
- data
- historical
- acquired
- hash value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000005259 measurement Methods 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims description 22
- 238000004458 analytical method Methods 0.000 claims description 18
- 238000013480 data collection Methods 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 9
- 238000007405 data analysis Methods 0.000 claims description 4
- 230000009471 action Effects 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000013075 data extraction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method, a device, equipment and a storage medium for acquiring network data increment, wherein the method specifically divides a page identifier and a historical page identifier into new page identifiers according to a data updating mode by judging whether the content of a page to be acquired is updated to perform acquisition action; if the historical page identifier is the historical page identifier, downloading page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a Simhash algorithm; loading the local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time and the local sensitive hash value acquired last time on the basis of a preset distance measurement algorithm; when the similarity is larger than the preset threshold value, the cached local sensitive hash value is updated, the historical page data is further analyzed and stored in the data acquisition database, the resource consumption of incremental data acquisition is reduced, the incremental acquisition efficiency of the network data is improved, and the goal of incremental acquisition is achieved to the maximum extent.
Description
Technical Field
The invention relates to the technical field of financial technology (Fintech), in particular to a network data increment acquisition method, a device, equipment and a computer readable storage medium.
Background
With the development of computer technology, more and more technologies are applied to the financial field, the traditional financial industry is gradually changed to financial technology (Fintech), the incremental acquisition technology of network data is not exceptional, but due to the requirements of security and real-time performance of the financial industry, higher requirements are also provided for the incremental acquisition technology at present, the incremental data acquisition mainly comprises three modes, namely, periodically acquiring website updated data based on page link UR L deduplication, periodically acquiring website updated data based on website page content deduplication, and directly acquiring website updated data in full quantity, but the first acquisition mode cannot identify the updated data of the website with updated page content and unchanged UR L, so that the omission of the acquired data is easily caused, the second acquisition mode is too sensitive to the website updated data and has larger calculation amount, and the third acquisition mode needs to acquire all the page data, so that the data acquisition efficiency is low.
Disclosure of Invention
The invention mainly aims to provide a network data increment acquisition method, a device, equipment and a computer readable storage medium, and aims to solve the technical problems of low acquisition accuracy and low accuracy of the existing increment data acquisition method.
In order to achieve the above object, the present invention provides a network data increment acquisition method, which comprises the following steps:
acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and judging whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier;
if the page identification of the page to be collected is the historical page identification, acquiring the historical page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;
acquiring a local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time on the historical page to be acquired and a local sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm;
and when the similarity is greater than a preset threshold value, updating the cached local sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library.
Optionally, if the page identifier of the page to be collected is the historical page identifier, the step of obtaining the historical page data to be collected, and calculating a partially sensitive hash value corresponding to the historical page data according to a specific partially sensitive hash algorithm Simhash specifically includes:
if the page identifier of the page to be acquired is the historical page identifier, acquiring historical page data to be acquired, cutting a webpage irrelevant code, and reserving partial data of a webpage body as the historical page data to be acquired;
segmenting the historical page data, extracting keywords after segmentation and preset weights corresponding to the keywords after segmentation, and converting the historical page data into a vector formed by a group of weighted characteristic values;
and calculating a weighted hash value corresponding to the weighted characteristic value vector based on a specific locality sensitive hash algorithm Simhash as a locality sensitive hash value corresponding to the historical page data.
Optionally, the step of updating the cached locality-sensitive hash value, analyzing the historical page data, and storing the historical page data in the data collection library when the similarity is greater than the preset threshold specifically includes:
when the similarity is larger than a preset threshold value, analyzing the downloaded historical page data, and calculating a hash value corresponding to the key content of the historical page analysis data based on a hash algorithm and the key content of the historical page analysis data to serve as a key content data fingerprint corresponding to the historical page data;
judging whether key content data fingerprints corresponding to the historical page data exist in the cached historical data fingerprints;
and if the key content data fingerprints corresponding to the historical page data do not exist in the historical data fingerprints, storing the historical page analysis data to the data acquisition base.
Optionally, before the steps of obtaining a page to be acquired in a target website, generating a page identifier of the page to be acquired, and determining whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier, the method further includes:
acquiring an acquisition page in the target website, and generating a data acquisition task set based on the acquisition page, wherein the data acquisition task set comprises at least one page to be acquired.
Optionally, after the steps of obtaining the locally sensitive hash value of the page to be acquired last time in the cache data, and calculating the similarity between the locally sensitive hash value of the page to be acquired last time and the locally sensitive hash value corresponding to the historical page data this time based on a preset distance measurement algorithm, the method further includes:
judging whether the similarity is greater than a preset threshold value or not;
and when the similarity is smaller than a preset threshold value, judging that the page data of the page to be collected is historical collected data, stopping analysis, and acquiring the next collection task in the data collection task set for data collection.
Optionally, when the similarity is greater than a preset threshold, the method further includes, after updating the cached locality-sensitive hash value, analyzing the historical page data and storing the historical page data in a data collection library:
and updating the local sensitive hash value of the page to be acquired which corresponds to the page to be acquired and is acquired last time into the local sensitive hash value of the historical page data which corresponds to the acquisition of this time in the cache data.
Optionally, after the steps of obtaining a page to be acquired in a target website, generating a page identifier of the page to be acquired, and determining whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier, the method further includes:
if the page identification of the page to be acquired is a new page identification, generating a new page data fingerprint of the new page, and judging whether the new page data fingerprint exists in preset historical acquired data fingerprints;
and if the new page data fingerprint does not exist in the historical acquisition data fingerprint, writing the new page data fingerprint into the historical data fingerprint, downloading and analyzing the new page data and storing the new page data into the data acquisition library.
In addition, in order to achieve the above object, the present invention further provides a network data increment acquisition device, including:
the page identification judging module is used for acquiring a page to be acquired in a target website, generating a page identification of the page to be acquired and judging whether the page identification of the page to be acquired is a new page identification or a historical page identification;
the page hash value calculation module is used for acquiring the historical page data to be acquired if the page identifier of the page to be acquired is the historical page identifier, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;
the page similarity calculation module is used for acquiring the locality sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the locality sensitive hash value acquired last time on the historical page to be acquired and the locality sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm;
and the page data analysis module is used for updating the cached local sensitive hash value when the similarity is greater than a preset threshold value, analyzing the historical page data and storing the historical page data in a data acquisition library.
In addition, to achieve the above object, the present invention further provides a network data incremental acquisition device, where the network data incremental acquisition device includes: the system comprises a memory, a processor and a network data increment acquisition program which is stored on the memory and can run on the processor, wherein the network data increment acquisition program realizes the steps of the network data increment acquisition method when being executed by the processor.
In addition, to achieve the above object, the present invention further provides a computer readable storage medium, on which a network data incremental acquisition program is stored, and the network data incremental acquisition program, when executed by a processor, implements the steps of the network data incremental acquisition method as described above.
The invention provides a network data increment acquisition method, which comprises the steps of acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and judging whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier; if the page identification of the page to be collected is the historical page identification, acquiring the historical page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash; acquiring a local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time on the historical page to be acquired and a local sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm; and when the similarity is greater than a preset threshold value, updating the cached local sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library. By the mode, when the page to be acquired is determined to be the historical page, the method calculates the local sensitive hash value corresponding to the current acquisition of the page to be acquired based on the specific local sensitive hash algorithm, then calculating the similarity between the two collected page data based on the local sensitive hash value corresponding to the historical page data and the local sensitive hash value which is collected last time on the page to be collected and corresponds to the page to be collected in the cache data, therefore, whether the historical page is updated or not is determined, the problem that the existing hash algorithm is too sensitive to the change of the page data is solved, the data calculation amount is reduced, the resource consumption of incremental data acquisition is reduced, the accuracy of the incremental data acquisition is improved, the efficiency of the incremental data acquisition is improved, and the technical problems that the existing incremental data acquisition method is low in acquisition efficiency and accuracy are solved.
Drawings
FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a network data incremental acquisition method according to a first embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
The network data increment acquisition equipment of the embodiment of the invention can be a PC (personal computer) or server equipment, and a Java virtual machine runs on the network data increment acquisition equipment.
As shown in fig. 1, the network data incremental acquisition device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include an operating system, a network communication module, a user interface module, and a network data incremental acquisition program therein.
In the device shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the network data incremental collecting program stored in the memory 1005 and perform the following operations in the network data incremental collecting method.
Based on the hardware structure, the embodiment of the network data increment acquisition method is provided.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of a network data increment acquisition method according to the present invention, where the network data increment acquisition method includes:
step S10, acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and judging whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier;
at present, the incremental acquisition of a target website acquired periodically is realized by three methods:
the first method is to remove the duplicate based on the new added webpage identifier UR L, namely, before data acquisition, whether the UR L of the current access page is the page UR L which has already acquired data is judged, if the UR L of the current access page is in the history acquisition page UR L and does not exceed the forced updating period, the acquisition of the page data is stopped, and if the page exceeds the forced updating period, the website data is considered to be updated and data acquisition is performed.
The second method is based on duplicate removal of web page content, that is, after comparing data streams returned by a target website server or analyzing content, it is determined whether the web page content has been collected, the determination process generally performs determination by calculating and comparing hash values of the web page content, and if not, the collected content is written in. But the hash value difference generated by the hash calculated by the method when the page data is not changed much is also very large. The method is too sensitive to website updating contents, and aiming at large text updating contents, the algorithm is long in time consumption and low in accuracy.
And the third mode is full acquisition, namely acquiring all newly added page data and writing the page data into a storage medium. And then judging whether the acquired data exist in the medium or not when the data are written into the storage medium, wherein the mode is used for acquiring a lot of invalid data, the acquired pages are not updated actually, the acquisition efficiency is low, a lot of acquisition resources are wasted, a lot of redundant data are generated, and the target website is accessed too frequently.
In order to solve the above problem, according to the present invention, when it is determined that a page to be acquired is a history page, based on a specific locality sensitive hash algorithm, a locality sensitive hash value corresponding to updated data of the page to be acquired is calculated, and then based on the locality sensitive hash value corresponding to the history page data and a locality sensitive hash value corresponding to a previous acquired locality sensitive hash value of the page to be acquired before the updated data of the page to be acquired in cache data, a similarity between the page data before updating and the page data after updating is calculated, thereby determining whether the page data of the history page is updated, so as to solve a problem that an existing hash algorithm is too sensitive to a change in the page data, reduce a data calculation amount, reduce Resource consumption of incremental data acquisition, improve an accuracy of incremental data acquisition, and improve efficiency of incremental data acquisition.
Further, before the step S10, the method further includes:
acquiring an acquisition page in the target website, and generating a data acquisition task set based on the acquisition page, wherein the data acquisition task set comprises at least one page to be acquired.
In the embodiment, new pages in each time interval in the target website are obtained according to a preset time interval, page UR L corresponding to each new page is added to a preset list, a data acquisition task set is generated, the data acquisition task set at least comprises one page to be acquired, then each page UR L in the data acquisition task set is sequentially obtained, and data acquisition operation of the page to be acquired is sequentially carried out until all pages in the data acquisition task set are processed.
Further, after the step S10, the method further includes:
if the page identification of the page to be acquired is a new page identification, generating a new page data fingerprint of the new page, and judging whether the new page data fingerprint exists in preset historical acquired data fingerprints;
and if the new page data fingerprint does not exist in the historical acquisition data fingerprint, writing the new page data fingerprint into the historical data fingerprint, downloading and analyzing the new page data and storing the new page data into the data acquisition library.
In this embodiment, if the page UR L of the page to be acquired is the new page UR L, the page to be acquired is identified as a new generated page, a hash algorithm is called, a hash value corresponding to the page UR L to be acquired is calculated, the hash value is set as a page data identifier corresponding to the page to be acquired, such as a data ID or a data fingerprint, i.e., a uniform identification method for different data contents is performed, then a historical data identifier corresponding to the target website in a data acquisition library is obtained and added to a Redis set, and the page data identifier is compared with the Redis set.
Step S20, if the page identifier of the page to be collected is the historical page identifier, acquiring the historical page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;
in the embodiment, if the page identifier of the page to be acquired is the historical page identifier, that is, the page to be acquired is an old page, further judging whether the page to be acquired is an updated old page, acquiring historical page data of the page to be acquired, and then calculating a local Sensitive hash value corresponding to the historical page data based on a specific local Sensitive hash algorithm Simhash, wherein the local Sensitive hash (L global-Sensitive Hashing, L SH) is used for solving the problem of neighbor search of high-dimensional space mass data.
Step S30, obtaining the local sensitive hash value of the last time of the page to be collected in the cache data, and calculating the similarity between the local sensitive hash value of the last time of the historical page to be collected and the local sensitive hash value corresponding to the historical page data to be collected at this time based on a preset distance measurement algorithm;
in this embodiment, the locally sensitive hash value, which is stored in the cache data in advance and is acquired last time of the page to be acquired, is acquired. Based on a preset distance measurement algorithm, such as a hamming distance calculation method, a common euclidean distance calculation method, a minkowski distance calculation method, a cosine distance calculation method, or the like, the distance between the locally sensitive hash value corresponding to the historical page data and the locally sensitive hash value acquired last time on the page to be acquired is calculated, so as to compare the similarity of the data before and after the update of the page to be acquired. I.e. similarity is determined based on some distance between points, close point distances being close.
Further, after the step S30, the method further includes:
judging whether the similarity is greater than a preset threshold value or not;
and when the similarity is smaller than a preset threshold value, judging that the page data of the page to be collected is historical collected data, stopping analysis, and acquiring the next collection task in the data collection task set for data collection.
In this embodiment, the similarity is compared with a preset threshold, and if the similarity exceeds the preset threshold, the similarity between the historical page data and the page data acquired last in the page to be acquired is higher, that is, the data difference between the page data before the update of the page to be acquired and the updated page data is smaller. And judging the page data of the page to be acquired as acquired data.
And step S40, when the similarity is larger than a preset threshold value, updating the cached local sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library.
In this embodiment, if the similarity is smaller than a preset threshold, the similarity between the historical page data and the page data acquired last in the page to be acquired is lower, that is, the data difference between the page data before the update of the page to be acquired and the page data after the update is larger. And collecting the historical page data and storing the historical page data in a data collection library. In this embodiment, a method for identifying increments by classification step by step is designed, so that a duplicate removal target is realized to the greatest extent on the basis of reducing resource consumption and reducing redundancy, and acquisition of update data of a new page and an old page of a target site is completed. A local sensitive Hash implementation algorithm simHash is introduced into the identification of whether an old page of a target site is updated or not, and the problems that the traditional Hash is too sensitive to the change of a website page, and aiming at a large text, the algorithm is long in time consumption and low in accuracy are solved.
The embodiment provides a network data increment acquisition method, which includes acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and judging whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier; if the page identification of the page to be collected is the historical page identification, acquiring the historical page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash; acquiring a local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time on the historical page to be acquired and a local sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm; and when the similarity is greater than a preset threshold value, updating the cached local sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library. By the mode, when the page to be acquired is determined to be the historical page, the method calculates the local sensitive hash value corresponding to the current acquisition of the page to be acquired based on the specific local sensitive hash algorithm, then calculating the similarity between the two collected page data based on the local sensitive hash value corresponding to the historical page data and the local sensitive hash value which is collected last time on the page to be collected and corresponds to the page to be collected in the cache data, therefore, whether the historical page is updated or not is determined, the problem that the existing hash algorithm is too sensitive to the change of the page data is solved, the data calculation amount is reduced, the resource consumption of incremental data acquisition is reduced, the accuracy of the incremental data acquisition is improved, the efficiency of the incremental data acquisition is improved, and the technical problems that the existing incremental data acquisition method is low in acquisition efficiency and accuracy are solved.
Further, based on the first embodiment of the network data incremental acquisition method of the present invention, a second embodiment of the network data incremental acquisition method of the present invention is provided.
In this embodiment, the step S20 specifically includes:
if the page identifier of the page to be acquired is the historical page identifier, acquiring historical page data to be acquired, cutting a webpage irrelevant code, and reserving partial data of a webpage body as the historical page data to be acquired;
segmenting the historical page data, extracting keywords after segmentation and preset weights corresponding to the keywords after segmentation, and converting the historical page data into a vector formed by a group of weighted characteristic values;
and calculating a weighted hash value corresponding to the weighted characteristic value vector based on a specific locality sensitive hash algorithm Simhash as a locality sensitive hash value corresponding to the historical page data.
When the similarity is greater than a preset threshold, updating the cached locality sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library specifically comprises the following steps:
when the similarity is larger than a preset threshold value, analyzing the downloaded historical page data, and calculating a hash value corresponding to the key content of the historical page analysis data based on a hash algorithm and the key content of the historical page analysis data to serve as a key content data fingerprint corresponding to the historical page data;
judging whether key content data fingerprints corresponding to the historical page data exist in the cached historical data fingerprints;
and if the key content data fingerprints corresponding to the historical page data do not exist in the historical data fingerprints, storing the historical page analysis data to the data acquisition base.
In this embodiment, if a target page of an acquisition task is an old page update, first downloading page data and clipping content in the page to be acquired that is not related to a page HTM L, such as relevant data corresponding to an updated deleted page style, and only preserving relevant data labeled as a body part, generating historical page data, then calculating a partially sensitive hash value Simhash value corresponding to the historical page data according to a specific partially sensitive algorithm Simhash, and calculating a partially sensitive hash value corresponding to the historical page data and a partially sensitive hash value corresponding to a last acquired partially sensitive hash value of the page to be acquired using a hamming distance algorithm, i.e., a similarity of a new and old page (wherein the partially sensitive hash value of the last acquired of the page to be acquired before updating the page to be acquired is cached in advance), wherein the calculation process of the specific partially sensitive algorithm Simhash is that the historical page data is participled, then extracting a key feature vector (featured _ n) after participling the historical page data, and then obtaining a set of corresponding weighted feature vector (a weighted hash value, i.e.g., a set of corresponding weighted hash value, and a corresponding weighted hash value, if the set of corresponding hash value is equal to a weighted hash value, the corresponding hash value, the weighted hash value is equal to a weighted hash value, then the corresponding to a weighted hash value, the weighted hash value of the corresponding hash value, the weighted hash value is calculated as a weighted hash value, the weighted hash value of the weighted hash value, the weighted hash value of the corresponding hash value is calculated as a weighted hash value of the corresponding hash value of a weighted hash value of the corresponding hash value of the weighted hash value of the corresponding to the corresponding hash value of the weighted hash value of the corresponding hash value of the weighted hash value of the corresponding hash value of the weighted hash value of the corresponding hash value of the local sensitive hash value of the weighted hash value of the local sensitive hash value of the corresponding hash value of.
In the embodiment, the acquisition action is carried out by judging whether the content of the page to be acquired is updated or not, and the new page identifier and the historical page identifier are specifically divided according to a data updating mode; if the historical page identifier is the historical page identifier, downloading page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a Simhash algorithm; loading the local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time and the local sensitive hash value acquired last time on the basis of a preset distance measurement algorithm; when the similarity is larger than the preset threshold value, the cached local sensitive hash value is updated, the historical page data is further analyzed and stored in the data acquisition database, the resource consumption of incremental data acquisition is reduced, the incremental acquisition efficiency of the network data is improved, and the goal of incremental acquisition is achieved to the maximum extent.
Further, after the step S40, the method further includes:
and updating the local sensitive hash value of the page to be acquired which corresponds to the page to be acquired and is acquired last time into the local sensitive hash value of the historical page data which corresponds to the acquisition of this time in the cache data.
In this embodiment, if the local sensitivity hash values are similar to each other, the next acquisition task is performed, otherwise, the page data is analyzed, and the simhash value stored in the corresponding page to be acquired in the data is cached, that is, the locally sensitive hash value acquired last on the page to be acquired is replaced with the locally sensitive hash value corresponding to the historical page data. Extracting key contents of the analysis data, generating a data fingerprint by using the hash, continuously writing the data fingerprint into Redis to judge whether the history is collected, if so, carrying out the next collection task, otherwise, writing the analysis data into a database to complete one collection task. And after the collection task set is completely finished, the incremental collection is finished, and the next scheduling is waited.
In this embodiment, the incremental crawler system based on the locality sensitive hash algorithm plans a uniform duplicate removal identifier, that is, a data fingerprint, optimizes an incremental acquisition flow, and on the basis of UR L duplicate removal, adopts the locality sensitive hash algorithm to determine whether the page content of a target website is updated, and further determines whether the key content of the page already exists when the page is written in a storage medium, and performs duplicate removal determination step by step, thereby reducing resource consumption, improving acquisition efficiency, and achieving the goal of incremental acquisition to the maximum extent.
The invention also provides a network data increment acquisition device, which comprises:
the page identification judging module is used for acquiring a page to be acquired in a target website, generating a page identification of the page to be acquired and judging whether the page identification of the page to be acquired is a new page identification or a historical page identification;
the page hash value calculation module is used for acquiring the historical page data to be acquired if the page identifier of the page to be acquired is the historical page identifier, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;
the page similarity calculation module is used for acquiring the locality sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the locality sensitive hash value acquired last time on the historical page to be acquired and the locality sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm;
and the page data analysis module is used for updating the cached local sensitive hash value when the similarity is greater than a preset threshold value, analyzing the historical page data and storing the historical page data in a data acquisition library.
Further, the network data increment acquisition device further comprises:
the acquisition task generating module is used for generating an acquisition task set corresponding to the incremental acquisition page of the target website and generating a data fingerprint corresponding to the acquisition page according to a hash algorithm and the UR L corresponding to the page to be acquired;
the page identification judging module is used for acquiring a page to be acquired in a target website, generating a page identification of the page to be acquired and judging whether the page identification of the page to be acquired is a new page identification or a historical page identification;
the page local sensitive hash value calculation module is used for downloading the historical page data to be acquired if the page identifier of the page to be acquired is the historical page identifier, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;
the page similarity calculation module is used for acquiring the local sensitive hash value of the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value of the page to be acquired in the last time and the local sensitive hash value corresponding to the historical page data acquisition based on a preset distance measurement algorithm;
the page data downloading module is used for downloading the target acquisition page data, namely the HTM L code;
the page data analysis module is used for analyzing the downloaded page data and extracting and formatting field values when the similarity is greater than a preset threshold value or the page identifier is a new page identifier and the data fingerprints do not have cache historical data fingerprints;
and the data storage module is used for storing the analyzed formatted data of the page to be acquired.
Further, the page hash value calculation module specifically includes:
the key data extraction unit is used for acquiring the historical page data to be acquired if the page identifier of the page to be acquired is the historical page identifier, cutting a webpage irrelevant code, and reserving partial data of a webpage body as the historical page data to be acquired;
the data word segmentation weighting unit is used for segmenting the historical page data, extracting keywords after word segmentation and preset weights corresponding to the keywords after word segmentation, and converting the historical page data into a vector formed by a group of weighted characteristic values;
and the hash value calculation unit is used for calculating the weighted hash value corresponding to the weighted characteristic value vector based on a specific locality sensitive hash algorithm Simhash, and the weighted hash value is used as the locality sensitive hash value corresponding to the historical page data.
Further, the page data acquisition module specifically includes:
the data identification calculation unit is used for analyzing the downloaded historical page data when the similarity is larger than a preset threshold value, and calculating a hash value corresponding to the key content of the historical page analysis data based on a hash algorithm and the key content of the historical page analysis data to serve as a key content data fingerprint corresponding to the historical page data;
the data identification judging unit is used for judging whether key content data fingerprints corresponding to the historical page data exist in the cached historical data fingerprints;
page data acquisition unit for
And if the key content data fingerprints corresponding to the historical page data do not exist in the historical data fingerprints, storing the historical page analysis data to the data acquisition base.
Further, the network data increment acquisition device further comprises:
and the task set generating module is used for acquiring the acquisition page in the target website and generating a data acquisition task set based on the acquisition page, wherein the data acquisition task set comprises at least one page to be acquired.
Further, the page similarity calculation module is further configured to:
judging whether the similarity is greater than a preset threshold value or not;
and when the similarity is smaller than a preset threshold value, judging that the page data of the page to be collected is historical collected data, stopping analysis, and acquiring the next collection task in the data collection task set for data collection.
Further, the page similarity calculation module is further configured to:
and updating the local sensitive hash value of the page to be acquired which corresponds to the page to be acquired and is acquired last time into the local sensitive hash value of the historical page data which corresponds to the acquisition of this time in the cache data.
Further, the page identifier determining module is further configured to:
if the page identification of the page to be acquired is a new page identification, generating a new page data fingerprint of the new page, and judging whether the new page data fingerprint exists in preset historical acquired data fingerprints;
and if the new page data fingerprint does not exist in the historical acquisition data fingerprint, writing the new page data fingerprint into the historical data fingerprint, downloading and analyzing the new page data and storing the new page data into the data acquisition library.
The method executed by each program module can refer to each embodiment of the network data increment acquisition method of the invention, and is not described herein again.
The invention also provides a computer readable storage medium.
The computer readable storage medium of the present invention stores thereon a network data incremental acquisition program, which when executed by a processor implements the steps of the network data incremental acquisition method as described above.
The method implemented when the network data increment acquisition program running on the processor is executed may refer to each embodiment of the network data increment acquisition method of the present invention, and details are not described here.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. A network data increment acquisition method is characterized by comprising the following steps:
acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and judging whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier;
if the page identification of the page to be collected is the historical page identification, acquiring the historical page data to be collected, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;
acquiring a local sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the local sensitive hash value acquired last time on the historical page to be acquired and a local sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm;
and when the similarity is greater than a preset threshold value, updating the cached local sensitive hash value, analyzing the historical page data and storing the historical page data in a data acquisition library.
2. The method for incrementally acquiring network data according to claim 1, wherein the step of acquiring the historical page data to be acquired and calculating the locality-sensitive hash value corresponding to the historical page data according to a specific locality-sensitive hash algorithm Simhash if the page identifier of the page to be acquired is the historical page identifier specifically comprises:
if the page identifier of the page to be acquired is the historical page identifier, acquiring historical page data to be acquired, cutting a webpage irrelevant code, and reserving partial data of a webpage body as the historical page data to be acquired;
segmenting the historical page data, extracting keywords after segmentation and preset weights corresponding to the keywords after segmentation, and converting the historical page data into a vector formed by a group of weighted characteristic values;
and calculating a weighted hash value corresponding to the weighted characteristic value vector based on a specific locality sensitive hash algorithm Simhash as a locality sensitive hash value corresponding to the historical page data.
3. The method for incrementally acquiring network data according to claim 2, wherein the step of updating the cached locality-sensitive hash value, parsing the historical page data, and storing the historical page data in the data acquisition library when the similarity is greater than the preset threshold specifically comprises:
when the similarity is larger than a preset threshold value, analyzing the downloaded historical page data, and calculating a hash value corresponding to the key content of the historical page analysis data based on a hash algorithm and the key content of the historical page analysis data to serve as a key content data fingerprint corresponding to the historical page data;
judging whether key content data fingerprints corresponding to the historical page data exist in the cached historical data fingerprints;
and if the key content data fingerprints corresponding to the historical page data do not exist in the historical data fingerprints, storing the historical page analysis data to the data acquisition base.
4. The method for incrementally acquiring network data according to claim 1, wherein before the steps of acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and determining whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier, the method further comprises:
acquiring an acquisition page in the target website, and generating a data acquisition task set based on the acquisition page, wherein the data acquisition task set comprises at least one page to be acquired.
5. The method according to claim 4, wherein after the steps of obtaining the locally sensitive hash value of the page to be acquired in the cache data, which was acquired last time, and calculating the similarity between the locally sensitive hash value of the page to be acquired last time and the locally sensitive hash value corresponding to the historical page data acquired this time based on a preset distance measurement algorithm, the method further comprises:
judging whether the similarity is greater than a preset threshold value or not;
and when the similarity is smaller than a preset threshold value, judging that the page data of the page to be collected is historical collected data, stopping analysis, and acquiring the next collection task in the data collection task set for data collection.
6. The method for incrementally acquiring network data as recited in claim 1, wherein after updating the cached locality-sensitive hash value and parsing the historical page data and storing the historical page data in a data acquisition repository when the similarity is greater than a preset threshold in the step, the method further comprises:
and updating the local sensitive hash value of the page to be acquired which corresponds to the page to be acquired and is acquired last time into the local sensitive hash value of the historical page data which corresponds to the acquisition of this time in the cache data.
7. The method for incrementally acquiring network data according to any one of claims 1 to 6, wherein after the steps of acquiring a page to be acquired in a target website, generating a page identifier of the page to be acquired, and determining whether the page identifier of the page to be acquired is a new page identifier or a historical page identifier, the method further comprises:
if the page identification of the page to be acquired is a new page identification, generating a new page data fingerprint of the new page, and judging whether the new page data fingerprint exists in preset historical acquired data fingerprints;
and if the new page data fingerprint does not exist in the historical acquisition data fingerprint, writing the new page data fingerprint into the historical data fingerprint, downloading and analyzing the new page data and storing the new page data into the data acquisition library.
8. A network data increment acquisition device is characterized by comprising:
the page identification judging module is used for acquiring a page to be acquired in a target website, generating a page identification of the page to be acquired and judging whether the page identification of the page to be acquired is a new page identification or a historical page identification;
the page hash value calculation module is used for acquiring the historical page data to be acquired if the page identifier of the page to be acquired is the historical page identifier, and calculating a local sensitive hash value corresponding to the historical page data according to a specific local sensitive hash algorithm Simhash;
the page similarity calculation module is used for acquiring the locality sensitive hash value acquired last time on the page to be acquired in the cache data, and calculating the similarity between the locality sensitive hash value acquired last time on the historical page to be acquired and the locality sensitive hash value corresponding to the historical page data to be acquired this time based on a preset distance measurement algorithm;
and the page data analysis module is used for updating the cached local sensitive hash value when the similarity is greater than a preset threshold value, analyzing the historical page data and storing the historical page data in a data acquisition library.
9. A network data incremental acquisition device, wherein the network data incremental acquisition device comprises: a memory, a processor and a network data incremental acquisition program stored on the memory and executable on the processor, the network data incremental acquisition program when executed by the processor implementing the steps of the network data incremental acquisition method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, on which a network data incremental acquisition program is stored, which when executed by a processor implements the steps of the network data incremental acquisition method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010242238.0A CN111444411A (en) | 2020-03-30 | 2020-03-30 | Network data increment acquisition method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010242238.0A CN111444411A (en) | 2020-03-30 | 2020-03-30 | Network data increment acquisition method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111444411A true CN111444411A (en) | 2020-07-24 |
Family
ID=71649595
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010242238.0A Pending CN111444411A (en) | 2020-03-30 | 2020-03-30 | Network data increment acquisition method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111444411A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112214270A (en) * | 2020-09-18 | 2021-01-12 | 北京鸿腾智能科技有限公司 | Page redrawing method, device, equipment and storage medium |
CN112631922A (en) * | 2020-12-28 | 2021-04-09 | 广州品唯软件有限公司 | Flow playback data selection method, system and storage medium |
CN118332217A (en) * | 2024-06-12 | 2024-07-12 | 上海蜜度科技股份有限公司 | Data acquisition method, system, electronic equipment and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178736A (en) * | 2007-12-11 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page collecting method and web page collecting server |
CN104391917A (en) * | 2014-11-19 | 2015-03-04 | 四川长虹电器股份有限公司 | Method for incrementally capturing webpage contents |
WO2017152550A1 (en) * | 2016-03-09 | 2017-09-14 | 乐视控股(北京)有限公司 | Webpage capture method and device |
-
2020
- 2020-03-30 CN CN202010242238.0A patent/CN111444411A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178736A (en) * | 2007-12-11 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page collecting method and web page collecting server |
CN104391917A (en) * | 2014-11-19 | 2015-03-04 | 四川长虹电器股份有限公司 | Method for incrementally capturing webpage contents |
WO2017152550A1 (en) * | 2016-03-09 | 2017-09-14 | 乐视控股(北京)有限公司 | Webpage capture method and device |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112214270A (en) * | 2020-09-18 | 2021-01-12 | 北京鸿腾智能科技有限公司 | Page redrawing method, device, equipment and storage medium |
CN112214270B (en) * | 2020-09-18 | 2024-09-17 | 三六零数字安全科技集团有限公司 | Page redrawing method, device, equipment and storage medium |
CN112631922A (en) * | 2020-12-28 | 2021-04-09 | 广州品唯软件有限公司 | Flow playback data selection method, system and storage medium |
CN118332217A (en) * | 2024-06-12 | 2024-07-12 | 上海蜜度科技股份有限公司 | Data acquisition method, system, electronic equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111444411A (en) | Network data increment acquisition method, device, equipment and storage medium | |
US9135289B2 (en) | Matching transactions in multi-level records | |
US9003529B2 (en) | Apparatus and method for identifying related code variants in binaries | |
CN112395305B (en) | SQL sentence analysis method and device, electronic equipment and storage medium | |
EP3819785A1 (en) | Feature word determining method, apparatus, and server | |
CN107038173B (en) | Application query method and device and similar application detection method and device | |
CN112148305B (en) | Application detection method, device, computer equipment and readable storage medium | |
US20220075794A1 (en) | Similarity analyses in analytics workflows | |
CN102236674B (en) | Method and device for updating index page | |
CN110532347B (en) | Log data processing method, device, equipment and storage medium | |
CN109246163B (en) | Terminal information identification method and device | |
CN109214004B (en) | Big data processing method based on machine learning | |
CN103559259A (en) | Method for eliminating similar-duplicate webpage on the basis of cloud platform | |
CN111159413A (en) | Log clustering method, device, equipment and storage medium | |
CN107357794B (en) | Method and device for optimizing data storage structure of key value database | |
CN112685475A (en) | Report query method and device, computer equipment and storage medium | |
CN111782595A (en) | Mass file management method and device, computer equipment and readable storage medium | |
CN114911830A (en) | Index caching method, device, equipment and storage medium based on time sequence database | |
CN112463784A (en) | Data deduplication method, device, equipment and computer readable storage medium | |
CN112445997A (en) | Method and device for extracting CMS multi-version identification feature rule | |
CN114266046A (en) | Network virus identification method and device, computer equipment and storage medium | |
CN116226681A (en) | Text similarity judging method and device, computer equipment and storage medium | |
CN114995880B (en) | Binary code similarity comparison method based on SimHash | |
CN113032575B (en) | Document blood relationship mining method and device based on topic model | |
CN111897719B (en) | Program change influence analysis method based on code text and calling relation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |