CN109918554A

CN109918554A - Web data crawling method, device, system and computer readable storage medium

Info

Publication number: CN109918554A
Application number: CN201910113261.7A
Authority: CN
Inventors: 吴启; 王雪春
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-02-13
Filing date: 2019-02-13
Publication date: 2019-06-21
Also published as: WO2020164276A1

Abstract

The present invention provides a kind of web data crawling method, it is related to data and crawls field, this method is applied to web data and crawls system, multiple crawler servers that the system includes control server and connect with control server, this method comprises: control server is when receiving the first web data of crawler server transmission, feature extraction is carried out to the first web data, obtains the first Text eigenvector；By the input of the first Text eigenvector, trained fingerprint generates model in advance, obtains the first web page fingerprint；The similarity value between the storage web page fingerprint in the first web page fingerprint and default storing data library is calculated, and is judged in similarity value with the presence or absence of the similarity value greater than preset threshold；If it does not exist, then the first web data is stored in default storing data library.The present invention also provides a kind of web datas to crawl device, system and computer readable storage medium.The present invention can reduce the repetitive rate of the web data crawled based on web crawlers.

Description

Web data crawling method, device, system and computer readable storage medium

Technical field

The present invention relates to data to crawl technical field more particularly to a kind of web data crawling method, device, system and meter Calculation machine readable storage medium storing program for executing.

Background technique

With greatly developing for network technology, data are obtained by internet and have become current people's acquisition information resources Important channel, and web crawlers has become the obtaining means of the web data of mainstream.However as turning wantonly for internet information It carries and more websites is launched, the data that web crawlers crawls out adulterate many redundancies and duplicate data mostly, give subsequent data Analysis affects.Therefore, the repeated data in a kind of web data that removable web crawlers crawls is needed Method, to reduce the repetitive rate of web data crawled.

Summary of the invention

The main purpose of the present invention is to provide a kind of web data crawling method, device, system and computer-readable deposit Storage media, it is intended to reduce the repetitive rate of the web data crawled based on web crawlers.

To achieve the above object, the present invention provides a kind of web data crawling method, and the web data crawling method is answered System is crawled for web data, the web data crawls system and includes control server and connect with the control server The multiple crawler servers connect, the web data crawling method include:

The control server is when receiving the first web data that the crawler server is sent, to first net Page data carries out feature extraction, obtains the first Text eigenvector；

By first Text eigenvector input, trained fingerprint generates model in advance, obtains the first web page fingerprint；

The similarity value between the storage web page fingerprint in first web page fingerprint and default storing data library is calculated, and Judge in the similarity value with the presence or absence of the similarity value greater than preset threshold；

If there is no the similarity values greater than preset threshold in the similarity value, first web data is stored in In the default storing data library.

Optionally, the control server is right when receiving from the first web data that the crawler server is sent Before the step of first web data is pre-processed, obtains the first Text eigenvector, the method also includes:

Training sample is obtained, the training sample includes multiple web data combinations through marking；

Feature extraction is carried out to the training sample, obtains the second Text eigenvector；

Second Text eigenvector is inputted into initial neural network model, obtains corresponding second web page fingerprint；

According to second web page fingerprint, the true similarity value of the training sample, it is calculated by preset algorithm Penalty values；

It is updated according to the penalty values by parameter of the gradient descent algorithm to the initial neural network model, and Training is iterated to the training sample, trained fingerprint is obtained and generates model.

Optionally, described according to second web page fingerprint, the true similarity value of the training sample, pass through default damage Losing the step of penalty values are calculated in function includes:

The similarity value between the web data in the training sample is calculated according to second web page fingerprint, is denoted as pre- Survey similarity value；

According to the prediction similarity value, the true similarity value of the training sample, calculated by default loss function Obtain penalty values；

Wherein, the default loss function are as follows:

Wherein,For the prediction similarity value, c is the true similarity value of the training sample.

Optionally, the control server is when receiving the first web data that the crawler server is sent, to institute Stating the step of the first web data carries out feature extraction, obtains the first Text eigenvector includes:

The control server is when receiving the first web data that the crawler server is sent, to first net Page data carries out word segmentation processing, obtains first participle collection；

The stop words that the first participle is concentrated is filtered according to default deactivated vocabulary, obtains the second participle collection；

The weight that second participle concentrates each participle is calculated according to preset rules, and is concentrated respectively according to second participle Participle and its weight obtain the first Text eigenvector.

Optionally, described to calculate the weight that second participle concentrates each participle according to preset rules, and according to described the Two, which segment the step of concentrating each participle and its weight to obtain the first Text eigenvector, includes:

It calculates described second and segments the word frequency for concentrating each participle and reverse document-frequency；

The reverse document-frequency of word frequency-of each participle is calculated according to the word frequency respectively segmented and reverse document-frequency, And using the reverse document-frequency of the word frequency-as weight；

The weight of preset quantity, and the power according to the weight of selection and with the selection are chosen according to the size of the weight The corresponding participle of weight, generates the first Text eigenvector.

Optionally, the web data crawling method further include:

It is receiving that the crawler server sends when crawling uniform resource position mark URL, is calculating described wait crawl The cryptographic Hash of URL；

Detect whether the cryptographic Hash is present in default key-value in storage Redis database；

If the cryptographic Hash does not exist in default Redis database, the URL to be crawled is added to default wait climb Queue is taken, so that the crawler server obtains the URL to be crawled from the default queue to be crawled, and according to described wait climb URL is taken to crawl new web data；

If the cryptographic Hash is present in the default Redis database, the URL to be crawled is deleted.

Optionally, the web data crawling method further include:

Count the data of the data traffic volume of the crawler server and the default storing data library in predetermined period Storage quantity, and corresponding statistical report is generated according to the data traffic volume and the data loading amount；

The statistical report is sent to default operational terminal, so that staff carries out Data duplication analysis.

In addition, to achieve the above object, the present invention also provides a kind of web datas to crawl device, the web data is crawled Device includes:

Fisrt feature extraction module, for the control server in the first net for receiving the crawler server transmission When page data, feature extraction is carried out to first web data, obtains the first Text eigenvector；

First fingerprint generation module, for trained fingerprint to generate mould in advance by first Text eigenvector input Type obtains the first web page fingerprint；

Similarity value judgment module, for calculating the storage webpage in first web page fingerprint and default storing data library Similarity value between fingerprint, and judge in the similarity value with the presence or absence of the similarity value greater than preset threshold；

Web data memory module, if for the similarity value greater than preset threshold to be not present in the similarity value, First web data is stored in the default storing data library.

In addition, to achieve the above object, the present invention also provides a kind of web datas to crawl system, the web data is crawled System includes control server and multiple crawler servers for connecting with the control server, go back memory, processor and It is stored on the memory and program can be crawled by the web data that the processor executes, wherein the web data crawls When program is executed by the processor, the step of realizing web data crawling method as described above.

In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium It is stored with web data on storage medium and crawls program, wherein the web data crawls program when being executed by processor, realizes The step of web data crawling method as described above.

The present invention provides a kind of web data crawling method, device, system and computer readable storage medium, this method and answers Web data for being constructed based on distributed reptile technology crawls system, and it includes control service which, which crawls system, Device and the multiple crawler servers being connect with the control server, this method comprises: control server is receiving crawler service When the first web data that device is sent, feature extraction is carried out to first web data, obtains the first Text eigenvector；By Trained fingerprint generates model to the input of one Text eigenvector in advance, obtains the first web page fingerprint；Calculate the first web page fingerprint With the similarity value between the storage web page fingerprint in default storing data library, and judge in the similarity value that is calculated whether In the presence of the similarity value for being greater than preset threshold；If it does not exist, just first web data is stored in the default storing data library. The present invention generates model by building fingerprint, corresponding fingerprint is generated to the web data crawled, then and in storing data library The fingerprint for having crawled the web data of preservation compares, and to judge whether to repeat, and then just protects to unduplicated data It deposits, to can avoid duplicate web data storage, therefore, the present invention can solve to crawl based on web crawlers in the prior art The higher problem of the data redundancy arrived.Meanwhile by combined with distributed reptile technology, it can be achieved that web data it is quick Crawl guarantees the acquisition for being completed in a relatively short time data, improves the efficiency of data acquisition.

Detailed description of the invention

Fig. 1 is the terminal structure schematic diagram for the hardware running environment that the embodiment of the present invention is related to；

Fig. 2 is the flow diagram of web data crawling method first embodiment of the present invention；

Fig. 3 is the flow diagram of web data crawling method second embodiment of the present invention；

Fig. 4 is the functional block diagram that web data of the present invention crawls device first embodiment.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

Fig. 1 is please referred to, Fig. 1 is the terminal structure schematic diagram for the hardware running environment that the embodiment of the present invention is related to.

The terminal of that embodiment of the invention is control server, the control server can be PC (personal computer, Personal computer), laptop, the terminal devices such as server.

As shown in Figure 1, the terminal may include: processor 1001, such as CPU (Central Processing Unit, Central processing unit), communication bus 1002, user interface 1003, network interface 1004, memory 1005.Wherein, communication bus 1002 for realizing the connection communication between these components；User interface 1003 may include display screen (Display), input list First such as keyboard (Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network connects Mouth 1004 may include optionally standard wireline interface and wireless interface (such as Wireless Fidelity Wireless-Fidelity, Wi-Fi Interface)；Memory 1005 can be high-speed random access memory (random access memory, RAM), be also possible to steady Fixed memory (non-volatile memory), such as magnetic disk storage, memory 1005 optionally can also be independently of The storage device of aforementioned processor 1001.It will be understood by those skilled in the art that hardware configuration shown in Fig. 1 is not constituted pair Restriction of the invention may include perhaps combining certain components or different component cloth than illustrating more or fewer components It sets.

With continued reference to Fig. 1, in Fig. 1 as may include in a kind of memory 1005 of computer storage medium operation system System, network communication module and web data crawl program.In Fig. 1, network communication module can be used for connecting server, with clothes Business device carries out data communication；And processor 1001 can be used for that the web data stored in memory 1005 is called to crawl program, And execute web data crawling method provided in an embodiment of the present invention.

Based on above-mentioned hardware configuration, each embodiment of web data crawling method of the present invention is proposed.

The present invention provides a kind of web data crawling method.

Referring to figure 2., Fig. 2 is the flow diagram of web data crawling method first embodiment of the present invention.

In the present embodiment, the web data crawling method crawls system, the web data applied to web data The system of crawling includes control server and multiple crawler servers for connecting with the control server, and the web data is climbed The method is taken to include:

Step S10, the control server is when receiving the first web data that the crawler server is sent, to institute It states the first web data and carries out feature extraction, obtain the first Text eigenvector；

In the present embodiment, which crawls system applied to web data, which crawls System is constructed based on distributed reptile Scrapy-Redis technology, wherein Scrapy-Redis is one based on Redis Scrapy distributed component, it is treated the URL task crawled using Redis and is stored and dispatched, and to the net for crawling generation Page data is stored for subsequent processing use.It includes control server (Master) and multiple that the web data, which crawls system, Crawler server (Slave), multiple crawler servers are communicated to connect with control server respectively.Wherein, master control server The fingerprint that responsibility is responsible for web data generates, fingerprint sentences weight, and (Uniform Resource Locator, unified resource are fixed by URL Position symbol) distribution of task, URL sentence the storage of weight and web data.Crawler server is mainly responsible for execution crawlers Web data is crawled, and the URL task new by the web data crawled and during crawling is submitted to master control server In Redis database.

In the present embodiment, which is realized by control server.During data crawl, crawler Server can obtain URL task to be crawled wait crawl based on what Redis technology constructed from control server in queue, then It is gone to crawl corresponding web data (being denoted as the first web data) according to the URL task to be crawled that this gets, and then should The first web data crawled is sent to control server.At this point, control server is receiving crawler server transmission When the first web data, feature extraction is carried out to first web data, obtains the first Text eigenvector.Specifically, step S10 includes:

Step a1, the control server is when receiving the first web data that the crawler server is sent, to institute It states the first web data and carries out word segmentation processing, obtain first participle collection；

Control server carries out first web data when receiving the first web data of crawler server transmission Word segmentation processing obtains first participle collection.It should be noted that word segmentation processing can be by segmenting execution of instrument, such as Chinese word Method analysis system ICTCLAS, Tsinghua University Chinese lexical analysis program THULAC, language technology platform LTP etc..Participle is mainly Every Chinese text in the sample data is cut into word one by one, and carried out by the characteristics of according to Chinese language Part-of-speech tagging.

Step a2 filters the stop words that the first participle is concentrated according to default deactivated vocabulary, obtains the second participle collection；

Then, part-of-speech tagging is carried out to the participle that the first participle is concentrated, and the first participle is filtered according to default deactivated vocabulary The stop words of concentration obtains the second participle collection.Wherein, stop words mainly includes two classes: the first kind is using excessively frequent one A little words, such as " I ", " just " etc., this kind of word almost will appear in each document；Second class is that occur frequency in the text Rate is very high, but the word without practical significance, and this kind of word, which is only put it into a complete sentence, just certain effect, including Auxiliary words of mood, adverbial word, preposition, conjunction etc., as " ", " ".By filtering stop words, the shadow of meaningless word can avoid It rings, to be conducive to improve the accuracy that Text eigenvector generates, and then improves the accuracy of webpage repetitive rate detection.

Step a3 calculates the weight that second participle concentrates each participle according to preset rules, and according to described second point Respectively participle and its weight obtain the first Text eigenvector in word set.

After obtaining the second participle collection, according to the weight of each participle of preset rules calculating the second participle concentration, and according to Second participle concentrates each participle and its weight to obtain the first Text eigenvector.Specifically, step a3 includes:

Step a31 calculates described second and segments the word frequency for concentrating each participle and reverse document-frequency；

In the present embodiment, it first calculates second and segments the word frequency for concentrating each participle and reverse document-frequency, wherein word frequency TF Indicate that some segments the frequency occurred in the first data web page, reverse document-frequency IDF indicates the keyword in all data Distribution situation in webpage is the measurement of a word general importance.The calculation formula of TF and IDF is as follows:

Wherein, n_iIndicate that the number that the participle occurs in the first data web page, n indicate the key in the first data web page Word sum, N indicate the webpage sum of data web page collection, N_iIndicate the webpage number in data web page collection in participle i.

Step a32, the reverse text of word frequency-that each participle is calculated according to the word frequency respectively segmented and reverse document-frequency Part frequency, and using the reverse document-frequency of the word frequency-as weight；

Then, the reverse document-frequency of word frequency-of each participle is calculated according to the word frequency of each participle and reverse document-frequency, And using the reverse document-frequency of word frequency-as weight.Wherein, the reverse document-frequency TF-IDF of word frequency-is word frequency and reverse file frequency The product of rate, i.e. TF-IDF=TF × IDF.

Step a33, according to the size of the weight choose preset quantity weight, and according to the weight of selection and with it is described The corresponding participle of the weight of selection generates the first Text eigenvector.

After the weight for obtaining each participle, the weight of preset quantity (such as k), i.e. selection root are chosen according to weight size K weights, are denoted as w respectively before coming according to weight size₁, w₂..., w_k.Then according to the weight of selection and with these choose Participle corresponding to weight generates the first Text eigenvector.Participle corresponding to weight with selection is denoted as s respectively₁, s₂..., s_k, generate first Text eigenvector V=((s corresponding with the first web data₁, w₁), (s_k, w₂) ..., (s_k, w_k)).Optionally, preset quantity is 10-20, i.e. k is the natural number between 10 to 20.

Certainly, in a particular embodiment, the weight that weight is less than preset threshold can also be filtered out according to weight size, And then according to filtered weight and its corresponding participle, the first Text eigenvector is generated.

Step S20, by first Text eigenvector input, trained fingerprint generates model in advance, obtains the first net Page fingerprint；

After obtaining the first Text eigenvector, by the input of the first Text eigenvector, trained fingerprint is generated in advance Model obtains the first web page fingerprint.Wherein, the specific training process that fingerprint generates model can refer to following second embodiments, this Place does not repeat.

Step S30 calculates the phase between first web page fingerprint and the storage web page fingerprint in default storing data library Like angle value, and judge in the similarity value with the presence or absence of the similarity value greater than preset threshold；

Model is generated according to trained fingerprint, a unique corresponding net can be generated to each web data crawled Page fingerprint, has the web data of same web page fingerprint to can be considered as the same, the web data of Similar content also has similar Web page fingerprint, therefore can be according to the corresponding web page fingerprint of the first web data (i.e. the first web page fingerprint) and default storing data In library storage web page fingerprint between similarity value, come judge the first web data with crawled storage web data whether There is repetition.Specifically, since web page fingerprint is a Hash character string, it can be by calculating the Hamming distances between web page fingerprint As similarity value, the similarity value for whether being greater than preset threshold in the similarity value being calculated then detected.Wherein, hamming Distance refers in information coding, and two legitimate codes, which correspond to, to be encoded different digits and be known as code distance on position.Specifically, hamming away from From calculation method can refer to the prior art, do not repeat herein.The preset threshold can be set according to the actual situation, herein It is not especially limited.

Step S40, if there is no the similarity values greater than preset threshold in the similarity value, by first webpage Data are stored in the default storing data library.

If being not present in the similarity value between storage web page fingerprint in the first web page fingerprint and default storing data library Greater than the similarity value of preset threshold, then illustrate that the first web data is not repeated with the web data crawled, at this point, then should First web data is stored in the default storing data library, for subsequent use.

If existing in the similarity value between storage web page fingerprint in the first web page fingerprint and default storing data library Greater than the similarity value of preset threshold, then illustrate between the first web data and some or certain web datas crawled Multiplicity is larger, which is most likely by adding and modifying obtained new web page number after reprinting through small range According to, and the web data similar with first web data has crawled storage, at this point, then deleting the first webpage number According to refusal is stored in default storing data library.

The present invention provides a kind of web data crawling method, applied to the web data constructed based on distributed reptile technology System is crawled, it includes control server and the multiple crawlers connecting with control server service which, which crawls system, Device, this method comprises: control server receive crawler server transmission the first web data when, to the first webpage number According to feature extraction is carried out, the first Text eigenvector is obtained；By the input of the first Text eigenvector, trained fingerprint is raw in advance At model, the first web page fingerprint is obtained；It calculates between the storage web page fingerprint in the first web page fingerprint and default storing data library Similarity value, and judge in the similarity value that is calculated with the presence or absence of the similarity value greater than preset threshold；If it does not exist, Just first web data is stored in the default storing data library.The present invention generates model by building fingerprint, to crawling Web data generate corresponding fingerprint, then the fingerprint with the web data for having crawled preservation in storing data library compares, To judge whether to repeat, and then unduplicated data are just saved, thus can avoid duplicate web data storage, because This, the present invention can solve the problems, such as that the data redundancy crawled in the prior art based on web crawlers is higher.Meanwhile passing through The quick crawl, it can be achieved that web data is combined with distributed reptile technology, guarantee is completed in a relatively short time data Acquisition improves the efficiency of data acquisition.

It further, is the flow diagram of web data crawling method second embodiment of the present invention referring to Fig. 3, Fig. 3.

Based on above-mentioned first embodiment shown in Fig. 2, before step S10, the web data crawling method further include:

Step S50, obtains training sample, and the training sample includes multiple web data combinations through marking；

Present embodiments provide the specific training process that fingerprint generates model.Firstly, obtaining training sample, wherein the instruction Practicing sample includes multiple web data combinations through marking.Optionally, each web data combination includes two web datas, mark Note process is the mark that true similarity value is carried out to the combination of each web data.Web data can be with each website in source, can also It needs to carry out editor's creation in real time with basis.

Step S60 carries out feature extraction to the training sample, obtains the second Text eigenvector；

Then, feature extraction is carried out to training sample, the second Text eigenvector is obtained, specifically, due to each training Sample is the combination for including two web datas, at this point, when carrying out feature extraction to training sample, then first respectively to each training Each web data in sample carries out word segmentation processing, and carries out part-of-speech tagging, is then filtered out and is stopped according to default deactivated vocabulary Word, obtains corresponding participle collection, so according to calculate the participle concentrate each participle word frequency and reverse document-frequency, according to word The reverse document-frequency of word frequency-is calculated in frequency and reverse document-frequency, and using the reverse document-frequency of word frequency-as weight, in turn Each participle and its weight is concentrated to obtain the second Text eigenvector according to the participle.It is corresponding, corresponding to each training sample Text eigenvector is two.Wherein, the specific generating process of the second Text eigenvector, in above-mentioned first embodiment first The generating process of Text eigenvector is essentially identical, does not repeat herein.

Second Text eigenvector is inputted initial neural network model, obtains corresponding second webpage by step S70 Fingerprint；

After obtaining the second Text eigenvector, the second Text eigenvector is inputted into initial neural network model, is obtained To corresponding second web page fingerprint.Specifically, two the second Text eigenvectors corresponding with the training sample are sequentially input Initial neural network model obtains the second web page fingerprint.Corresponding, corresponding second web page fingerprint of each training sample is also two It is a.

Step S80 passes through preset algorithm according to second web page fingerprint, the true similarity value of the training sample Penalty values are calculated；

According to the second web page fingerprint, the true similarity value of training sample, penalty values are calculated by preset algorithm.Tool Body, step S80 includes:

Step b1 calculates the similarity between the web data in the training sample according to second web page fingerprint Value is denoted as prediction similarity value；

In the present embodiment, the similarity between the web data in training sample is first calculated according to the second web page fingerprint Value is denoted as prediction similarity value, specifically, similarity value can be obtained by the method for calculating Hamming distances.

Step b2 passes through default loss letter according to the prediction similarity value, the true similarity value of the training sample Penalty values are calculated in number；

After prediction similarity value is calculated, according to the prediction similarity value, the true similarity value of training sample, Penalty values are calculated by default loss function.Wherein, the default loss function are as follows:

Wherein,To predict that similarity value, c are the true similarity value of training sample.

Step S90, according to the penalty values by gradient descent algorithm to the parameter of the initial neural network model into Row updates, and is iterated training to the training sample, obtains trained fingerprint and generates model.

Finally, according to the penalty values being calculated by gradient descent algorithm to the parameter of the initial neural network model into Row updates, and is iterated training to each training sample, i.e., is updated according to penalty values each in initial neural network model The gradient of layer node, and then update the weighting parameter of each node, continually enter the second text feature corresponding to training sample to Amount is iterated until network convergence, until the penalty values stablize drop to a smaller range (as lower than a preset threshold or Reach minimum value), at this point, trained neural network model can be obtained, i.e. the good fingerprint of stretched wire generates model.By under gradient Drop algorithm can solve the optimization problem of extensive sample data, and specific gradient descent algorithm can refer to the prior art, herein not It repeats.

Further, the various embodiments described above are based on, propose the 3rd embodiment of web data crawling method of the present invention.

In the present embodiment, after the step s 40, which can also include:

Step A is receiving that the crawler server sends when crawling uniform resource position mark URL, described in calculating The cryptographic Hash of URL to be crawled；

Step B, detects whether the cryptographic Hash is present in default key-value in storage Redis database；

In the present embodiment, crawler server can extract new during crawling data from the webpage grabbed URL (Uniform Resource Locator, uniform resource locator), and then new URL (being denoted as URL to be crawled) is sent out Give control server, at this point, control server receive crawler server transmission when crawling URL, calculate wait crawl The cryptographic Hash of URL.In turn, detect whether the cryptographic Hash is present in default Redis (key-value is to storage) database.Wherein, The calculation method of cryptographic Hash can refer to the prior art, not repeat herein.

If the cryptographic Hash does not exist in default Redis database, C is thened follow the steps: the URL to be crawled is added To queue to be crawled is preset, so that the crawler server obtains the URL to be crawled, and root from the default queue to be crawled New web data is crawled according to the URL to be crawled；

If the cryptographic Hash is present in the default Redis database, D is thened follow the steps: deleting described wait crawl URL。

If the cryptographic Hash does not exist in default Redis database, illustrate that the URL to be crawled was not crawled, at this point, Then the URL to be crawled is added to and presets queue to be crawled, obtaining so that crawler server is subsequent from default queue to be crawled should URL to be crawled, and new web data is crawled according to the URL to be crawled.

If the cryptographic Hash is present in default Redis database, illustrate to be crawled before the URL to be crawled, this When, then it deletes and is somebody's turn to do URL to be crawled, URL to be crawled is added to default wait crawl in queue by refusal.

It is avoidable repeatedly to grab same webpage by carrying out duplicate removal processing to duplicate URL in the present embodiment, so as to Further decrease the repetitive rate of the web data crawled.

In the present embodiment, after the step S40 of the first embodiment or the second embodiment, the web data crawling method Can also include:

Step E, statistics the data traffic volume of the crawler server and default storing data library in predetermined period Data loading amount, and corresponding statistical report is generated according to the data traffic volume and the data loading amount；

The statistical report is sent to default operational terminal by step F, so that staff carries out Data duplication analysis.

In the present embodiment, for convenience of staff understand web data repetition situation, control server it is statistics available The data loading amount of the data traffic volume of crawler server and default storing data library in predetermined period, and according to data traffic volume Corresponding statistical report is generated with data storage quantity.Wherein, predetermined period can be set to one day or one week etc., can be according to reality Situation is set, and is not construed as limiting herein.Statistical time can determine according to predetermined period, for example, when predetermined period is one week, then It is once counted every other week.In statistical report, data traffic volume, data loading amount and data loading rate (number can be shown According to storage rate=data loading amount/data traffic volume) etc..

After generating statistical report, which is sent to default operational terminal, so that staff counts According to replicate analysis.

Further, it is crawled since crawler server can be from multiple and different websites, corresponding, control service The web data that device receives also is derived from different websites, can be dimension with website (data source) when generating statistical report Degree carries out classification analysis, to obtain the data loading rate of different web sites；The website low for storage rate, it is believed that be Data duplication More serious website, the data that staff can remove the website crawl, to reduce the resource consumption of server.

Further, it is also possible to do further collect statistics every preset time to above-mentioned statistical report, obtain collect statistics Report.Such as ought once it be counted every other week, it, can be every three months, to above-mentioned statistics report weekly after obtaining statistical report It accuses and carries out collect statistics, obtain Quarterly Statistical Report.The Quarterly Statistical Report can graphically display data traffic volume, number According to storage quantity and the variation tendency of data loading rate.In turn, collect statistics report is sent to the default working end, for work Make personnel to understand from the repetition situation for macroscopically understanding data.

The present invention also provides a kind of web datas to crawl device.

Referring to Fig. 4, Fig. 4 is the functional block diagram that web data of the present invention crawls device first embodiment.

In the present embodiment, the web data crawls device and includes:

Fisrt feature extraction module 10 is receiving the first of the crawler server transmission for the control server When web data, feature extraction is carried out to first web data, obtains the first Text eigenvector；

First fingerprint generation module 20, for trained fingerprint to generate in advance by first Text eigenvector input Model obtains the first web page fingerprint；

Similarity value judgment module 30, for calculating the storage net in first web page fingerprint and default storing data library Similarity value between page fingerprint, and judge in the similarity value with the presence or absence of the similarity value greater than preset threshold；

Web data memory module 40, if for the similarity value greater than preset threshold to be not present in the similarity value, Then first web data is stored in the default storing data library.

Wherein, each virtual functions module that above-mentioned web data crawls device, which is stored in web data shown in Fig. 1 and crawls, is In the memory 1005 of system, the institute for crawling program for realizing web data is functional；When each module is executed by processor 1001, Can be achieved to generate the web data that crawls in corresponding fingerprint, then with webpage number that preservation has been crawled in default storing data library According to fingerprint compare, to judge whether to repeat, and then reject the function of duplicate web data.

Further, the web data crawls device further include:

Training sample obtains module, and for obtaining training sample, the training sample includes multiple webpage numbers through marking According to combination；

Second feature extraction module obtains the second Text eigenvector for carrying out feature extraction to the training sample；

Second fingerprint generation module is obtained for second Text eigenvector to be inputted initial neural network model Corresponding second web page fingerprint；

Penalty values computing module leads to for the true similarity value according to second web page fingerprint, the training sample It crosses preset algorithm and penalty values is calculated；

Fingerprint generates model training module, for passing through gradient descent algorithm to the initial nerve according to the penalty values The parameter of network model is updated, and is iterated training to the training sample, is obtained trained fingerprint and is generated model.

Further, the penalty values computing module includes:

Similarity value computing unit is predicted, for calculating the webpage in the training sample according to second web page fingerprint Similarity value between data is denoted as prediction similarity value；

Penalty values computing unit leads to for the true similarity value according to the prediction similarity value, the training sample It crosses default loss function and penalty values is calculated；

Wherein, the default loss function are as follows:

Further, the fisrt feature extraction module 10 includes:

Word segmentation processing unit, for the control server in the first webpage number for receiving the crawler server transmission According to when, to first web data carry out word segmentation processing, obtain first participle collection；

Stop words filter element is obtained for filtering the stop words that the first participle is concentrated according to default deactivated vocabulary Second participle collection；

Fisrt feature acquiring unit, for calculating the weight that second participle concentrates each participle according to preset rules, and Each participle and its weight is concentrated to obtain the first Text eigenvector according to second participle.

Further, the fisrt feature acquiring unit includes:

First computation subunit segments the word frequency for concentrating each participle and reverse document-frequency for calculating described second；

Second computation subunit, for each participle to be calculated according to the word frequency respectively segmented and reverse document-frequency The reverse document-frequency of word frequency-, and using the reverse document-frequency of the word frequency-as weight；

Fisrt feature obtains subelement, for choosing the weight of preset quantity according to the size of the weight, and according to choosing The weight taken and participle corresponding with the weight of the selection generate the first Text eigenvector.

Further, the web data crawls device further include:

Cryptographic Hash computing module, in the uniform resource locator to be crawled for receiving the crawler server transmission When URL, the cryptographic Hash of the URL to be crawled is calculated；

Cryptographic Hash detection module, for detecting whether the cryptographic Hash is present in default key-value to storage Redis data In library；

URL adding module, if not existed in default Redis database for the cryptographic Hash, by described wait crawl URL, which is added to, presets queue to be crawled, so that the crawler server is described wait crawl from the default queue acquisition to be crawled URL, and new web data is crawled according to the URL to be crawled；

URL removing module, if being present in the default Redis database for the cryptographic Hash, described in deletion URL to be crawled.

Further, the web data crawls device further include:

Statistical report generation module, for counting the data traffic volume of the crawler server in predetermined period and described The data loading amount in default storing data library, and corresponding statistics is generated according to the data traffic volume and the data loading amount Report；

Statistical report sending module, for the statistical report to be sent to default operational terminal, for staff into Row data replicate analysis.

Wherein, the function that above-mentioned web data crawls modules in device is realized real with above-mentioned web data crawling method It is corresponding to apply each step in example, function and realization process no longer repeat one by one here.

The present invention also provides a kind of web datas to crawl system, the web data crawl system include control server and The multiple crawler servers connecting with the control server further include memory, processor and are stored on the memory And the web data that can be run on the processor crawls program, the web data crawls program and is executed by the processor The step of web data crawling method of the Shi Shixian as described in any of the above item embodiment.

Web data of the present invention crawl system specific embodiment and above-mentioned each embodiment of web data crawling method it is basic Identical, therefore not to repeat here.

The present invention also provides a kind of computer readable storage medium, webpage number is stored on the computer readable storage medium According to program is crawled, the web data crawls the webpage realized as described in any of the above item embodiment when program is executed by processor The step of data crawling method.

The specific embodiment of computer readable storage medium of the present invention and each embodiment base of above-mentioned web data crawling method This is identical, and therefore not to repeat here.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the system that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or system institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or system.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that an equipment (can be mobile phone, calculate Machine, server, air conditioner or network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of web data crawling method, which is characterized in that the web data crawling method is crawled applied to web data System, it includes control server and the multiple crawlers connecting with control server service that the web data, which crawls system, Device, the web data crawling method the following steps are included:

The control server is when receiving the first web data that the crawler server is sent, to the first webpage number According to feature extraction is carried out, the first Text eigenvector is obtained；

The similarity value between the storage web page fingerprint in first web page fingerprint and default storing data library is calculated, and is judged With the presence or absence of the similarity value greater than preset threshold in the similarity value；

If there is no the similarity values greater than preset threshold in the similarity value, will be described in first web data deposit In default storing data library.

2. web data crawling method as described in claim 1, which is characterized in that the control server is being received from institute When stating the first web data of crawler server transmission, first web data is pre-processed, obtains the first text spy Before the step of levying vector, the method also includes:

According to second web page fingerprint, the true similarity value of the training sample, loss is calculated by preset algorithm Value；

It is updated according to the penalty values by parameter of the gradient descent algorithm to the initial neural network model, and to institute It states training sample and is iterated training, obtain trained fingerprint and generate model.

3. web data crawling method as claimed in claim 2, which is characterized in that it is described according to second web page fingerprint, The true similarity value of the training sample, the step of penalty values are calculated by default loss function include:

The similarity value between the web data in the training sample is calculated according to second web page fingerprint, is denoted as prediction phase Like angle value；

According to the prediction similarity value, the true similarity value of the training sample, it is calculated by default loss function Penalty values；

Wherein, the default loss function are as follows:

4. web data crawling method as described in claim 1, which is characterized in that the control server receive it is described When the first web data that crawler server is sent, feature extraction is carried out to first web data, obtains the first text spy Levy vector the step of include:

The control server is when receiving the first web data that the crawler server is sent, to the first webpage number According to word segmentation processing is carried out, first participle collection is obtained；

The weight that second participle concentrates each participle is calculated according to preset rules, and each participle is concentrated according to second participle And its weight obtains the first Text eigenvector.

5. web data crawling method as claimed in claim 4, which is characterized in that described to calculate described the according to preset rules Two participles concentrate the weight of each participles, and according to second participle concentrate each participle and its weight obtain the first text feature to The step of amount includes:

The reverse document-frequency of word frequency-of each participle is calculated according to the word frequency respectively segmented and reverse document-frequency, and will The reverse document-frequency of the word frequency-is as weight；

The weight of preset quantity, and the weight pair according to the weight of selection and with the selection are chosen according to the size of the weight The participle answered generates the first Text eigenvector.

6. web data crawling method as described in claim 1, which is characterized in that the web data crawling method also wraps It includes:

It is receiving that the crawler server sends when crawling uniform resource position mark URL, is calculating the URL's to be crawled Cryptographic Hash；

If the cryptographic Hash does not exist in default Redis database, the URL to be crawled is added to and presets team to be crawled Column, so that the crawler server obtains the URL to be crawled from the default queue to be crawled, and according to described wait crawl URL crawls new web data；

7. such as web data crawling method described in any one of claims 1 to 6, which is characterized in that the web data is climbed Take method further include:

Count the data loading of the data traffic volume of the crawler server and the default storing data library in predetermined period Amount, and corresponding statistical report is generated according to the data traffic volume and the data loading amount；

8. a kind of web data crawls device, which is characterized in that the web data crawls device and includes:

Fisrt feature extraction module, for the control server in the first webpage number for receiving the crawler server transmission According to when, to first web data carry out feature extraction, obtain the first Text eigenvector；

First fingerprint generation module, for the preparatory trained fingerprint of first Text eigenvector input to be generated model, Obtain the first web page fingerprint；

Similarity value judgment module, for calculating the storage web page fingerprint in first web page fingerprint and default storing data library Between similarity value, and judge in the similarity value with the presence or absence of greater than preset threshold similarity value；

Web data memory module, if for there is no the similarity values greater than preset threshold in the similarity value, by institute The first web data is stated to be stored in the default storing data library.

9. a kind of web data crawls system, which is characterized in that the web data crawl system include control server and with Multiple crawler servers of the control server connection further include memory, processor and are stored on the memory And program can be crawled by the web data that the processor executes, it is held wherein the web data crawls program by the processor When row, the step of realizing web data crawling method as described in any one of claims 1 to 7.

10. a kind of computer readable storage medium, which is characterized in that be stored with webpage number on the computer readable storage medium According to crawling program, wherein the web data crawls program when being executed by processor, such as any one of claims 1 to 7 is realized The step of described web data crawling method.