CN109918554A - Web data crawling method, device, system and computer readable storage medium - Google Patents
Web data crawling method, device, system and computer readable storage medium Download PDFInfo
- Publication number
- CN109918554A CN109918554A CN201910113261.7A CN201910113261A CN109918554A CN 109918554 A CN109918554 A CN 109918554A CN 201910113261 A CN201910113261 A CN 201910113261A CN 109918554 A CN109918554 A CN 109918554A
- Authority
- CN
- China
- Prior art keywords
- web data
- web
- data
- similarity value
- default
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 70
- 230000009193 crawling Effects 0.000 title claims abstract description 54
- 238000000605 extraction Methods 0.000 claims abstract description 23
- 230000005540 biological transmission Effects 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 56
- 230000006870 function Effects 0.000 claims description 15
- 238000003062 neural network model Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 12
- 239000012141 concentrate Substances 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000004458 analytical method Methods 0.000 claims description 6
- 241001269238 Data Species 0.000 abstract description 8
- 230000003252 repetitive effect Effects 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 241000270322 Lepidosauria Species 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000013500 data storage Methods 0.000 description 3
- 238000004321 preservation Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000010224 classification analysis Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000003442 weekly effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Abstract
The present invention provides a kind of web data crawling method, it is related to data and crawls field, this method is applied to web data and crawls system, multiple crawler servers that the system includes control server and connect with control server, this method comprises: control server is when receiving the first web data of crawler server transmission, feature extraction is carried out to the first web data, obtains the first Text eigenvector;By the input of the first Text eigenvector, trained fingerprint generates model in advance, obtains the first web page fingerprint;The similarity value between the storage web page fingerprint in the first web page fingerprint and default storing data library is calculated, and is judged in similarity value with the presence or absence of the similarity value greater than preset threshold;If it does not exist, then the first web data is stored in default storing data library.The present invention also provides a kind of web datas to crawl device, system and computer readable storage medium.The present invention can reduce the repetitive rate of the web data crawled based on web crawlers.
Description
Technical field
The present invention relates to data to crawl technical field more particularly to a kind of web data crawling method, device, system and meter
Calculation machine readable storage medium storing program for executing.
Background technique
With greatly developing for network technology, data are obtained by internet and have become current people's acquisition information resources
Important channel, and web crawlers has become the obtaining means of the web data of mainstream.However as turning wantonly for internet information
It carries and more websites is launched, the data that web crawlers crawls out adulterate many redundancies and duplicate data mostly, give subsequent data
Analysis affects.Therefore, the repeated data in a kind of web data that removable web crawlers crawls is needed
Method, to reduce the repetitive rate of web data crawled.
Summary of the invention
The main purpose of the present invention is to provide a kind of web data crawling method, device, system and computer-readable deposit
Storage media, it is intended to reduce the repetitive rate of the web data crawled based on web crawlers.
To achieve the above object, the present invention provides a kind of web data crawling method, and the web data crawling method is answered
System is crawled for web data, the web data crawls system and includes control server and connect with the control server
The multiple crawler servers connect, the web data crawling method include:
The control server is when receiving the first web data that the crawler server is sent, to first net
Page data carries out feature extraction, obtains the first Text eigenvector;
By first Text eigenvector input, trained fingerprint generates model in advance, obtains the first web page fingerprint;
The similarity value between the storage web page fingerprint in first web page fingerprint and default storing data library is calculated, and
Judge in the similarity value with the presence or absence of the similarity value greater than preset threshold;
If there is no the similarity values greater than preset threshold in the similarity value, first web data is stored in
In the default storing data library.
Optionally, the control server is right when receiving from the first web data that the crawler server is sent
Before the step of first web data is pre-processed, obtains the first Text eigenvector, the method also includes:
Training sample is obtained, the training sample includes multiple web data combinations through marking;
Feature extraction is carried out to the training sample, obtains the second Text eigenvector;
Second Text eigenvector is inputted into initial neural network model, obtains corresponding second web page fingerprint;
According to second web page fingerprint, the true similarity value of the training sample, it is calculated by preset algorithm
Penalty values;
It is updated according to the penalty values by parameter of the gradient descent algorithm to the initial neural network model, and
Training is iterated to the training sample, trained fingerprint is obtained and generates model.
Optionally, described according to second web page fingerprint, the true similarity value of the training sample, pass through default damage
Losing the step of penalty values are calculated in function includes:
The similarity value between the web data in the training sample is calculated according to second web page fingerprint, is denoted as pre-
Survey similarity value;
According to the prediction similarity value, the true similarity value of the training sample, calculated by default loss function
Obtain penalty values;
Wherein, the default loss function are as follows:
Wherein,For the prediction similarity value, c is the true similarity value of the training sample.
Optionally, the control server is when receiving the first web data that the crawler server is sent, to institute
Stating the step of the first web data carries out feature extraction, obtains the first Text eigenvector includes:
The control server is when receiving the first web data that the crawler server is sent, to first net
Page data carries out word segmentation processing, obtains first participle collection;
The stop words that the first participle is concentrated is filtered according to default deactivated vocabulary, obtains the second participle collection;
The weight that second participle concentrates each participle is calculated according to preset rules, and is concentrated respectively according to second participle
Participle and its weight obtain the first Text eigenvector.
Optionally, described to calculate the weight that second participle concentrates each participle according to preset rules, and according to described the
Two, which segment the step of concentrating each participle and its weight to obtain the first Text eigenvector, includes:
It calculates described second and segments the word frequency for concentrating each participle and reverse document-frequency;
The reverse document-frequency of word frequency-of each participle is calculated according to the word frequency respectively segmented and reverse document-frequency,
And using the reverse document-frequency of the word frequency-as weight;
The weight of preset quantity, and the power according to the weight of selection and with the selection are chosen according to the size of the weight
The corresponding participle of weight, generates the first Text eigenvector.
Optionally, the web data crawling method further include:
It is receiving that the crawler server sends when crawling uniform resource position mark URL, is calculating described wait crawl
The cryptographic Hash of URL;
Detect whether the cryptographic Hash is present in default key-value in storage Redis database;
If the cryptographic Hash does not exist in default Redis database, the URL to be crawled is added to default wait climb
Queue is taken, so that the crawler server obtains the URL to be crawled from the default queue to be crawled, and according to described wait climb
URL is taken to crawl new web data;
If the cryptographic Hash is present in the default Redis database, the URL to be crawled is deleted.
Optionally, the web data crawling method further include:
Count the data of the data traffic volume of the crawler server and the default storing data library in predetermined period
Storage quantity, and corresponding statistical report is generated according to the data traffic volume and the data loading amount;
The statistical report is sent to default operational terminal, so that staff carries out Data duplication analysis.
In addition, to achieve the above object, the present invention also provides a kind of web datas to crawl device, the web data is crawled
Device includes:
Fisrt feature extraction module, for the control server in the first net for receiving the crawler server transmission
When page data, feature extraction is carried out to first web data, obtains the first Text eigenvector;
First fingerprint generation module, for trained fingerprint to generate mould in advance by first Text eigenvector input
Type obtains the first web page fingerprint;
Similarity value judgment module, for calculating the storage webpage in first web page fingerprint and default storing data library
Similarity value between fingerprint, and judge in the similarity value with the presence or absence of the similarity value greater than preset threshold;
Web data memory module, if for the similarity value greater than preset threshold to be not present in the similarity value,
First web data is stored in the default storing data library.
In addition, to achieve the above object, the present invention also provides a kind of web datas to crawl system, the web data is crawled
System includes control server and multiple crawler servers for connecting with the control server, go back memory, processor and
It is stored on the memory and program can be crawled by the web data that the processor executes, wherein the web data crawls
When program is executed by the processor, the step of realizing web data crawling method as described above.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium
It is stored with web data on storage medium and crawls program, wherein the web data crawls program when being executed by processor, realizes
The step of web data crawling method as described above.
The present invention provides a kind of web data crawling method, device, system and computer readable storage medium, this method and answers
Web data for being constructed based on distributed reptile technology crawls system, and it includes control service which, which crawls system,
Device and the multiple crawler servers being connect with the control server, this method comprises: control server is receiving crawler service
When the first web data that device is sent, feature extraction is carried out to first web data, obtains the first Text eigenvector;By
Trained fingerprint generates model to the input of one Text eigenvector in advance, obtains the first web page fingerprint;Calculate the first web page fingerprint
With the similarity value between the storage web page fingerprint in default storing data library, and judge in the similarity value that is calculated whether
In the presence of the similarity value for being greater than preset threshold;If it does not exist, just first web data is stored in the default storing data library.
The present invention generates model by building fingerprint, corresponding fingerprint is generated to the web data crawled, then and in storing data library
The fingerprint for having crawled the web data of preservation compares, and to judge whether to repeat, and then just protects to unduplicated data
It deposits, to can avoid duplicate web data storage, therefore, the present invention can solve to crawl based on web crawlers in the prior art
The higher problem of the data redundancy arrived.Meanwhile by combined with distributed reptile technology, it can be achieved that web data it is quick
Crawl guarantees the acquisition for being completed in a relatively short time data, improves the efficiency of data acquisition.
Detailed description of the invention
Fig. 1 is the terminal structure schematic diagram for the hardware running environment that the embodiment of the present invention is related to;
Fig. 2 is the flow diagram of web data crawling method first embodiment of the present invention;
Fig. 3 is the flow diagram of web data crawling method second embodiment of the present invention;
Fig. 4 is the functional block diagram that web data of the present invention crawls device first embodiment.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
Fig. 1 is please referred to, Fig. 1 is the terminal structure schematic diagram for the hardware running environment that the embodiment of the present invention is related to.
The terminal of that embodiment of the invention is control server, the control server can be PC (personal computer,
Personal computer), laptop, the terminal devices such as server.
As shown in Figure 1, the terminal may include: processor 1001, such as CPU (Central Processing Unit,
Central processing unit), communication bus 1002, user interface 1003, network interface 1004, memory 1005.Wherein, communication bus
1002 for realizing the connection communication between these components;User interface 1003 may include display screen (Display), input list
First such as keyboard (Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network connects
Mouth 1004 may include optionally standard wireline interface and wireless interface (such as Wireless Fidelity Wireless-Fidelity, Wi-Fi
Interface);Memory 1005 can be high-speed random access memory (random access memory, RAM), be also possible to steady
Fixed memory (non-volatile memory), such as magnetic disk storage, memory 1005 optionally can also be independently of
The storage device of aforementioned processor 1001.It will be understood by those skilled in the art that hardware configuration shown in Fig. 1 is not constituted pair
Restriction of the invention may include perhaps combining certain components or different component cloth than illustrating more or fewer components
It sets.
With continued reference to Fig. 1, in Fig. 1 as may include in a kind of memory 1005 of computer storage medium operation system
System, network communication module and web data crawl program.In Fig. 1, network communication module can be used for connecting server, with clothes
Business device carries out data communication;And processor 1001 can be used for that the web data stored in memory 1005 is called to crawl program,
And execute web data crawling method provided in an embodiment of the present invention.
Based on above-mentioned hardware configuration, each embodiment of web data crawling method of the present invention is proposed.
The present invention provides a kind of web data crawling method.
Referring to figure 2., Fig. 2 is the flow diagram of web data crawling method first embodiment of the present invention.
In the present embodiment, the web data crawling method crawls system, the web data applied to web data
The system of crawling includes control server and multiple crawler servers for connecting with the control server, and the web data is climbed
The method is taken to include:
Step S10, the control server is when receiving the first web data that the crawler server is sent, to institute
It states the first web data and carries out feature extraction, obtain the first Text eigenvector;
In the present embodiment, which crawls system applied to web data, which crawls
System is constructed based on distributed reptile Scrapy-Redis technology, wherein Scrapy-Redis is one based on Redis
Scrapy distributed component, it is treated the URL task crawled using Redis and is stored and dispatched, and to the net for crawling generation
Page data is stored for subsequent processing use.It includes control server (Master) and multiple that the web data, which crawls system,
Crawler server (Slave), multiple crawler servers are communicated to connect with control server respectively.Wherein, master control server
The fingerprint that responsibility is responsible for web data generates, fingerprint sentences weight, and (Uniform Resource Locator, unified resource are fixed by URL
Position symbol) distribution of task, URL sentence the storage of weight and web data.Crawler server is mainly responsible for execution crawlers
Web data is crawled, and the URL task new by the web data crawled and during crawling is submitted to master control server
In Redis database.
In the present embodiment, which is realized by control server.During data crawl, crawler
Server can obtain URL task to be crawled wait crawl based on what Redis technology constructed from control server in queue, then
It is gone to crawl corresponding web data (being denoted as the first web data) according to the URL task to be crawled that this gets, and then should
The first web data crawled is sent to control server.At this point, control server is receiving crawler server transmission
When the first web data, feature extraction is carried out to first web data, obtains the first Text eigenvector.Specifically, step
S10 includes:
Step a1, the control server is when receiving the first web data that the crawler server is sent, to institute
It states the first web data and carries out word segmentation processing, obtain first participle collection;
Control server carries out first web data when receiving the first web data of crawler server transmission
Word segmentation processing obtains first participle collection.It should be noted that word segmentation processing can be by segmenting execution of instrument, such as Chinese word
Method analysis system ICTCLAS, Tsinghua University Chinese lexical analysis program THULAC, language technology platform LTP etc..Participle is mainly
Every Chinese text in the sample data is cut into word one by one, and carried out by the characteristics of according to Chinese language
Part-of-speech tagging.
Step a2 filters the stop words that the first participle is concentrated according to default deactivated vocabulary, obtains the second participle collection;
Then, part-of-speech tagging is carried out to the participle that the first participle is concentrated, and the first participle is filtered according to default deactivated vocabulary
The stop words of concentration obtains the second participle collection.Wherein, stop words mainly includes two classes: the first kind is using excessively frequent one
A little words, such as " I ", " just " etc., this kind of word almost will appear in each document;Second class is that occur frequency in the text
Rate is very high, but the word without practical significance, and this kind of word, which is only put it into a complete sentence, just certain effect, including
Auxiliary words of mood, adverbial word, preposition, conjunction etc., as " ", " ".By filtering stop words, the shadow of meaningless word can avoid
It rings, to be conducive to improve the accuracy that Text eigenvector generates, and then improves the accuracy of webpage repetitive rate detection.
Step a3 calculates the weight that second participle concentrates each participle according to preset rules, and according to described second point
Respectively participle and its weight obtain the first Text eigenvector in word set.
After obtaining the second participle collection, according to the weight of each participle of preset rules calculating the second participle concentration, and according to
Second participle concentrates each participle and its weight to obtain the first Text eigenvector.Specifically, step a3 includes:
Step a31 calculates described second and segments the word frequency for concentrating each participle and reverse document-frequency;
In the present embodiment, it first calculates second and segments the word frequency for concentrating each participle and reverse document-frequency, wherein word frequency TF
Indicate that some segments the frequency occurred in the first data web page, reverse document-frequency IDF indicates the keyword in all data
Distribution situation in webpage is the measurement of a word general importance.The calculation formula of TF and IDF is as follows:
Wherein, niIndicate that the number that the participle occurs in the first data web page, n indicate the key in the first data web page
Word sum, N indicate the webpage sum of data web page collection, NiIndicate the webpage number in data web page collection in participle i.
Step a32, the reverse text of word frequency-that each participle is calculated according to the word frequency respectively segmented and reverse document-frequency
Part frequency, and using the reverse document-frequency of the word frequency-as weight;
Then, the reverse document-frequency of word frequency-of each participle is calculated according to the word frequency of each participle and reverse document-frequency,
And using the reverse document-frequency of word frequency-as weight.Wherein, the reverse document-frequency TF-IDF of word frequency-is word frequency and reverse file frequency
The product of rate, i.e. TF-IDF=TF × IDF.
Step a33, according to the size of the weight choose preset quantity weight, and according to the weight of selection and with it is described
The corresponding participle of the weight of selection generates the first Text eigenvector.
After the weight for obtaining each participle, the weight of preset quantity (such as k), i.e. selection root are chosen according to weight size
K weights, are denoted as w respectively before coming according to weight size1, w2..., wk.Then according to the weight of selection and with these choose
Participle corresponding to weight generates the first Text eigenvector.Participle corresponding to weight with selection is denoted as s respectively1,
s2..., sk, generate first Text eigenvector V=((s corresponding with the first web data1, w1), (sk, w2) ..., (sk,
wk)).Optionally, preset quantity is 10-20, i.e. k is the natural number between 10 to 20.
Certainly, in a particular embodiment, the weight that weight is less than preset threshold can also be filtered out according to weight size,
And then according to filtered weight and its corresponding participle, the first Text eigenvector is generated.
Step S20, by first Text eigenvector input, trained fingerprint generates model in advance, obtains the first net
Page fingerprint;
After obtaining the first Text eigenvector, by the input of the first Text eigenvector, trained fingerprint is generated in advance
Model obtains the first web page fingerprint.Wherein, the specific training process that fingerprint generates model can refer to following second embodiments, this
Place does not repeat.
Step S30 calculates the phase between first web page fingerprint and the storage web page fingerprint in default storing data library
Like angle value, and judge in the similarity value with the presence or absence of the similarity value greater than preset threshold;
Model is generated according to trained fingerprint, a unique corresponding net can be generated to each web data crawled
Page fingerprint, has the web data of same web page fingerprint to can be considered as the same, the web data of Similar content also has similar
Web page fingerprint, therefore can be according to the corresponding web page fingerprint of the first web data (i.e. the first web page fingerprint) and default storing data
In library storage web page fingerprint between similarity value, come judge the first web data with crawled storage web data whether
There is repetition.Specifically, since web page fingerprint is a Hash character string, it can be by calculating the Hamming distances between web page fingerprint
As similarity value, the similarity value for whether being greater than preset threshold in the similarity value being calculated then detected.Wherein, hamming
Distance refers in information coding, and two legitimate codes, which correspond to, to be encoded different digits and be known as code distance on position.Specifically, hamming away from
From calculation method can refer to the prior art, do not repeat herein.The preset threshold can be set according to the actual situation, herein
It is not especially limited.
Step S40, if there is no the similarity values greater than preset threshold in the similarity value, by first webpage
Data are stored in the default storing data library.
If being not present in the similarity value between storage web page fingerprint in the first web page fingerprint and default storing data library
Greater than the similarity value of preset threshold, then illustrate that the first web data is not repeated with the web data crawled, at this point, then should
First web data is stored in the default storing data library, for subsequent use.
If existing in the similarity value between storage web page fingerprint in the first web page fingerprint and default storing data library
Greater than the similarity value of preset threshold, then illustrate between the first web data and some or certain web datas crawled
Multiplicity is larger, which is most likely by adding and modifying obtained new web page number after reprinting through small range
According to, and the web data similar with first web data has crawled storage, at this point, then deleting the first webpage number
According to refusal is stored in default storing data library.
The present invention provides a kind of web data crawling method, applied to the web data constructed based on distributed reptile technology
System is crawled, it includes control server and the multiple crawlers connecting with control server service which, which crawls system,
Device, this method comprises: control server receive crawler server transmission the first web data when, to the first webpage number
According to feature extraction is carried out, the first Text eigenvector is obtained;By the input of the first Text eigenvector, trained fingerprint is raw in advance
At model, the first web page fingerprint is obtained;It calculates between the storage web page fingerprint in the first web page fingerprint and default storing data library
Similarity value, and judge in the similarity value that is calculated with the presence or absence of the similarity value greater than preset threshold;If it does not exist,
Just first web data is stored in the default storing data library.The present invention generates model by building fingerprint, to crawling
Web data generate corresponding fingerprint, then the fingerprint with the web data for having crawled preservation in storing data library compares,
To judge whether to repeat, and then unduplicated data are just saved, thus can avoid duplicate web data storage, because
This, the present invention can solve the problems, such as that the data redundancy crawled in the prior art based on web crawlers is higher.Meanwhile passing through
The quick crawl, it can be achieved that web data is combined with distributed reptile technology, guarantee is completed in a relatively short time data
Acquisition improves the efficiency of data acquisition.
It further, is the flow diagram of web data crawling method second embodiment of the present invention referring to Fig. 3, Fig. 3.
Based on above-mentioned first embodiment shown in Fig. 2, before step S10, the web data crawling method further include:
Step S50, obtains training sample, and the training sample includes multiple web data combinations through marking;
Present embodiments provide the specific training process that fingerprint generates model.Firstly, obtaining training sample, wherein the instruction
Practicing sample includes multiple web data combinations through marking.Optionally, each web data combination includes two web datas, mark
Note process is the mark that true similarity value is carried out to the combination of each web data.Web data can be with each website in source, can also
It needs to carry out editor's creation in real time with basis.
Step S60 carries out feature extraction to the training sample, obtains the second Text eigenvector;
Then, feature extraction is carried out to training sample, the second Text eigenvector is obtained, specifically, due to each training
Sample is the combination for including two web datas, at this point, when carrying out feature extraction to training sample, then first respectively to each training
Each web data in sample carries out word segmentation processing, and carries out part-of-speech tagging, is then filtered out and is stopped according to default deactivated vocabulary
Word, obtains corresponding participle collection, so according to calculate the participle concentrate each participle word frequency and reverse document-frequency, according to word
The reverse document-frequency of word frequency-is calculated in frequency and reverse document-frequency, and using the reverse document-frequency of word frequency-as weight, in turn
Each participle and its weight is concentrated to obtain the second Text eigenvector according to the participle.It is corresponding, corresponding to each training sample
Text eigenvector is two.Wherein, the specific generating process of the second Text eigenvector, in above-mentioned first embodiment first
The generating process of Text eigenvector is essentially identical, does not repeat herein.
Second Text eigenvector is inputted initial neural network model, obtains corresponding second webpage by step S70
Fingerprint;
After obtaining the second Text eigenvector, the second Text eigenvector is inputted into initial neural network model, is obtained
To corresponding second web page fingerprint.Specifically, two the second Text eigenvectors corresponding with the training sample are sequentially input
Initial neural network model obtains the second web page fingerprint.Corresponding, corresponding second web page fingerprint of each training sample is also two
It is a.
Step S80 passes through preset algorithm according to second web page fingerprint, the true similarity value of the training sample
Penalty values are calculated;
According to the second web page fingerprint, the true similarity value of training sample, penalty values are calculated by preset algorithm.Tool
Body, step S80 includes:
Step b1 calculates the similarity between the web data in the training sample according to second web page fingerprint
Value is denoted as prediction similarity value;
In the present embodiment, the similarity between the web data in training sample is first calculated according to the second web page fingerprint
Value is denoted as prediction similarity value, specifically, similarity value can be obtained by the method for calculating Hamming distances.
Step b2 passes through default loss letter according to the prediction similarity value, the true similarity value of the training sample
Penalty values are calculated in number;
After prediction similarity value is calculated, according to the prediction similarity value, the true similarity value of training sample,
Penalty values are calculated by default loss function.Wherein, the default loss function are as follows:
Wherein,To predict that similarity value, c are the true similarity value of training sample.
Step S90, according to the penalty values by gradient descent algorithm to the parameter of the initial neural network model into
Row updates, and is iterated training to the training sample, obtains trained fingerprint and generates model.
Finally, according to the penalty values being calculated by gradient descent algorithm to the parameter of the initial neural network model into
Row updates, and is iterated training to each training sample, i.e., is updated according to penalty values each in initial neural network model
The gradient of layer node, and then update the weighting parameter of each node, continually enter the second text feature corresponding to training sample to
Amount is iterated until network convergence, until the penalty values stablize drop to a smaller range (as lower than a preset threshold or
Reach minimum value), at this point, trained neural network model can be obtained, i.e. the good fingerprint of stretched wire generates model.By under gradient
Drop algorithm can solve the optimization problem of extensive sample data, and specific gradient descent algorithm can refer to the prior art, herein not
It repeats.
Further, the various embodiments described above are based on, propose the 3rd embodiment of web data crawling method of the present invention.
In the present embodiment, after the step s 40, which can also include:
Step A is receiving that the crawler server sends when crawling uniform resource position mark URL, described in calculating
The cryptographic Hash of URL to be crawled;
Step B, detects whether the cryptographic Hash is present in default key-value in storage Redis database;
In the present embodiment, crawler server can extract new during crawling data from the webpage grabbed
URL (Uniform Resource Locator, uniform resource locator), and then new URL (being denoted as URL to be crawled) is sent out
Give control server, at this point, control server receive crawler server transmission when crawling URL, calculate wait crawl
The cryptographic Hash of URL.In turn, detect whether the cryptographic Hash is present in default Redis (key-value is to storage) database.Wherein,
The calculation method of cryptographic Hash can refer to the prior art, not repeat herein.
If the cryptographic Hash does not exist in default Redis database, C is thened follow the steps: the URL to be crawled is added
To queue to be crawled is preset, so that the crawler server obtains the URL to be crawled, and root from the default queue to be crawled
New web data is crawled according to the URL to be crawled;
If the cryptographic Hash is present in the default Redis database, D is thened follow the steps: deleting described wait crawl
URL。
If the cryptographic Hash does not exist in default Redis database, illustrate that the URL to be crawled was not crawled, at this point,
Then the URL to be crawled is added to and presets queue to be crawled, obtaining so that crawler server is subsequent from default queue to be crawled should
URL to be crawled, and new web data is crawled according to the URL to be crawled.
If the cryptographic Hash is present in default Redis database, illustrate to be crawled before the URL to be crawled, this
When, then it deletes and is somebody's turn to do URL to be crawled, URL to be crawled is added to default wait crawl in queue by refusal.
It is avoidable repeatedly to grab same webpage by carrying out duplicate removal processing to duplicate URL in the present embodiment, so as to
Further decrease the repetitive rate of the web data crawled.
Further, the various embodiments described above are based on, propose the 3rd embodiment of web data crawling method of the present invention.
In the present embodiment, after the step S40 of the first embodiment or the second embodiment, the web data crawling method
Can also include:
Step E, statistics the data traffic volume of the crawler server and default storing data library in predetermined period
Data loading amount, and corresponding statistical report is generated according to the data traffic volume and the data loading amount;
The statistical report is sent to default operational terminal by step F, so that staff carries out Data duplication analysis.
In the present embodiment, for convenience of staff understand web data repetition situation, control server it is statistics available
The data loading amount of the data traffic volume of crawler server and default storing data library in predetermined period, and according to data traffic volume
Corresponding statistical report is generated with data storage quantity.Wherein, predetermined period can be set to one day or one week etc., can be according to reality
Situation is set, and is not construed as limiting herein.Statistical time can determine according to predetermined period, for example, when predetermined period is one week, then
It is once counted every other week.In statistical report, data traffic volume, data loading amount and data loading rate (number can be shown
According to storage rate=data loading amount/data traffic volume) etc..
After generating statistical report, which is sent to default operational terminal, so that staff counts
According to replicate analysis.
Further, it is crawled since crawler server can be from multiple and different websites, corresponding, control service
The web data that device receives also is derived from different websites, can be dimension with website (data source) when generating statistical report
Degree carries out classification analysis, to obtain the data loading rate of different web sites;The website low for storage rate, it is believed that be Data duplication
More serious website, the data that staff can remove the website crawl, to reduce the resource consumption of server.
Further, it is also possible to do further collect statistics every preset time to above-mentioned statistical report, obtain collect statistics
Report.Such as ought once it be counted every other week, it, can be every three months, to above-mentioned statistics report weekly after obtaining statistical report
It accuses and carries out collect statistics, obtain Quarterly Statistical Report.The Quarterly Statistical Report can graphically display data traffic volume, number
According to storage quantity and the variation tendency of data loading rate.In turn, collect statistics report is sent to the default working end, for work
Make personnel to understand from the repetition situation for macroscopically understanding data.
The present invention also provides a kind of web datas to crawl device.
Referring to Fig. 4, Fig. 4 is the functional block diagram that web data of the present invention crawls device first embodiment.
In the present embodiment, the web data crawls device and includes:
Fisrt feature extraction module 10 is receiving the first of the crawler server transmission for the control server
When web data, feature extraction is carried out to first web data, obtains the first Text eigenvector;
First fingerprint generation module 20, for trained fingerprint to generate in advance by first Text eigenvector input
Model obtains the first web page fingerprint;
Similarity value judgment module 30, for calculating the storage net in first web page fingerprint and default storing data library
Similarity value between page fingerprint, and judge in the similarity value with the presence or absence of the similarity value greater than preset threshold;
Web data memory module 40, if for the similarity value greater than preset threshold to be not present in the similarity value,
Then first web data is stored in the default storing data library.
Wherein, each virtual functions module that above-mentioned web data crawls device, which is stored in web data shown in Fig. 1 and crawls, is
In the memory 1005 of system, the institute for crawling program for realizing web data is functional;When each module is executed by processor 1001,
Can be achieved to generate the web data that crawls in corresponding fingerprint, then with webpage number that preservation has been crawled in default storing data library
According to fingerprint compare, to judge whether to repeat, and then reject the function of duplicate web data.
Further, the web data crawls device further include:
Training sample obtains module, and for obtaining training sample, the training sample includes multiple webpage numbers through marking
According to combination;
Second feature extraction module obtains the second Text eigenvector for carrying out feature extraction to the training sample;
Second fingerprint generation module is obtained for second Text eigenvector to be inputted initial neural network model
Corresponding second web page fingerprint;
Penalty values computing module leads to for the true similarity value according to second web page fingerprint, the training sample
It crosses preset algorithm and penalty values is calculated;
Fingerprint generates model training module, for passing through gradient descent algorithm to the initial nerve according to the penalty values
The parameter of network model is updated, and is iterated training to the training sample, is obtained trained fingerprint and is generated model.
Further, the penalty values computing module includes:
Similarity value computing unit is predicted, for calculating the webpage in the training sample according to second web page fingerprint
Similarity value between data is denoted as prediction similarity value;
Penalty values computing unit leads to for the true similarity value according to the prediction similarity value, the training sample
It crosses default loss function and penalty values is calculated;
Wherein, the default loss function are as follows:
Wherein,For the prediction similarity value, c is the true similarity value of the training sample.
Further, the fisrt feature extraction module 10 includes:
Word segmentation processing unit, for the control server in the first webpage number for receiving the crawler server transmission
According to when, to first web data carry out word segmentation processing, obtain first participle collection;
Stop words filter element is obtained for filtering the stop words that the first participle is concentrated according to default deactivated vocabulary
Second participle collection;
Fisrt feature acquiring unit, for calculating the weight that second participle concentrates each participle according to preset rules, and
Each participle and its weight is concentrated to obtain the first Text eigenvector according to second participle.
Further, the fisrt feature acquiring unit includes:
First computation subunit segments the word frequency for concentrating each participle and reverse document-frequency for calculating described second;
Second computation subunit, for each participle to be calculated according to the word frequency respectively segmented and reverse document-frequency
The reverse document-frequency of word frequency-, and using the reverse document-frequency of the word frequency-as weight;
Fisrt feature obtains subelement, for choosing the weight of preset quantity according to the size of the weight, and according to choosing
The weight taken and participle corresponding with the weight of the selection generate the first Text eigenvector.
Further, the web data crawls device further include:
Cryptographic Hash computing module, in the uniform resource locator to be crawled for receiving the crawler server transmission
When URL, the cryptographic Hash of the URL to be crawled is calculated;
Cryptographic Hash detection module, for detecting whether the cryptographic Hash is present in default key-value to storage Redis data
In library;
URL adding module, if not existed in default Redis database for the cryptographic Hash, by described wait crawl
URL, which is added to, presets queue to be crawled, so that the crawler server is described wait crawl from the default queue acquisition to be crawled
URL, and new web data is crawled according to the URL to be crawled;
URL removing module, if being present in the default Redis database for the cryptographic Hash, described in deletion
URL to be crawled.
Further, the web data crawls device further include:
Statistical report generation module, for counting the data traffic volume of the crawler server in predetermined period and described
The data loading amount in default storing data library, and corresponding statistics is generated according to the data traffic volume and the data loading amount
Report;
Statistical report sending module, for the statistical report to be sent to default operational terminal, for staff into
Row data replicate analysis.
Wherein, the function that above-mentioned web data crawls modules in device is realized real with above-mentioned web data crawling method
It is corresponding to apply each step in example, function and realization process no longer repeat one by one here.
The present invention also provides a kind of web datas to crawl system, the web data crawl system include control server and
The multiple crawler servers connecting with the control server further include memory, processor and are stored on the memory
And the web data that can be run on the processor crawls program, the web data crawls program and is executed by the processor
The step of web data crawling method of the Shi Shixian as described in any of the above item embodiment.
Web data of the present invention crawl system specific embodiment and above-mentioned each embodiment of web data crawling method it is basic
Identical, therefore not to repeat here.
The present invention also provides a kind of computer readable storage medium, webpage number is stored on the computer readable storage medium
According to program is crawled, the web data crawls the webpage realized as described in any of the above item embodiment when program is executed by processor
The step of data crawling method.
The specific embodiment of computer readable storage medium of the present invention and each embodiment base of above-mentioned web data crawling method
This is identical, and therefore not to repeat here.
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row
His property includes, so that the process, method, article or the system that include a series of elements not only include those elements, and
And further include other elements that are not explicitly listed, or further include for this process, method, article or system institute it is intrinsic
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do
There is also other identical elements in the process, method of element, article or system.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art
The part contributed out can be embodied in the form of software products, which is stored in one as described above
In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that an equipment (can be mobile phone, calculate
Machine, server, air conditioner or network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of web data crawling method, which is characterized in that the web data crawling method is crawled applied to web data
System, it includes control server and the multiple crawlers connecting with control server service that the web data, which crawls system,
Device, the web data crawling method the following steps are included:
The control server is when receiving the first web data that the crawler server is sent, to the first webpage number
According to feature extraction is carried out, the first Text eigenvector is obtained;
By first Text eigenvector input, trained fingerprint generates model in advance, obtains the first web page fingerprint;
The similarity value between the storage web page fingerprint in first web page fingerprint and default storing data library is calculated, and is judged
With the presence or absence of the similarity value greater than preset threshold in the similarity value;
If there is no the similarity values greater than preset threshold in the similarity value, will be described in first web data deposit
In default storing data library.
2. web data crawling method as described in claim 1, which is characterized in that the control server is being received from institute
When stating the first web data of crawler server transmission, first web data is pre-processed, obtains the first text spy
Before the step of levying vector, the method also includes:
Training sample is obtained, the training sample includes multiple web data combinations through marking;
Feature extraction is carried out to the training sample, obtains the second Text eigenvector;
Second Text eigenvector is inputted into initial neural network model, obtains corresponding second web page fingerprint;
According to second web page fingerprint, the true similarity value of the training sample, loss is calculated by preset algorithm
Value;
It is updated according to the penalty values by parameter of the gradient descent algorithm to the initial neural network model, and to institute
It states training sample and is iterated training, obtain trained fingerprint and generate model.
3. web data crawling method as claimed in claim 2, which is characterized in that it is described according to second web page fingerprint,
The true similarity value of the training sample, the step of penalty values are calculated by default loss function include:
The similarity value between the web data in the training sample is calculated according to second web page fingerprint, is denoted as prediction phase
Like angle value;
According to the prediction similarity value, the true similarity value of the training sample, it is calculated by default loss function
Penalty values;
Wherein, the default loss function are as follows:
Wherein,For the prediction similarity value, c is the true similarity value of the training sample.
4. web data crawling method as described in claim 1, which is characterized in that the control server receive it is described
When the first web data that crawler server is sent, feature extraction is carried out to first web data, obtains the first text spy
Levy vector the step of include:
The control server is when receiving the first web data that the crawler server is sent, to the first webpage number
According to word segmentation processing is carried out, first participle collection is obtained;
The stop words that the first participle is concentrated is filtered according to default deactivated vocabulary, obtains the second participle collection;
The weight that second participle concentrates each participle is calculated according to preset rules, and each participle is concentrated according to second participle
And its weight obtains the first Text eigenvector.
5. web data crawling method as claimed in claim 4, which is characterized in that described to calculate described the according to preset rules
Two participles concentrate the weight of each participles, and according to second participle concentrate each participle and its weight obtain the first text feature to
The step of amount includes:
It calculates described second and segments the word frequency for concentrating each participle and reverse document-frequency;
The reverse document-frequency of word frequency-of each participle is calculated according to the word frequency respectively segmented and reverse document-frequency, and will
The reverse document-frequency of the word frequency-is as weight;
The weight of preset quantity, and the weight pair according to the weight of selection and with the selection are chosen according to the size of the weight
The participle answered generates the first Text eigenvector.
6. web data crawling method as described in claim 1, which is characterized in that the web data crawling method also wraps
It includes:
It is receiving that the crawler server sends when crawling uniform resource position mark URL, is calculating the URL's to be crawled
Cryptographic Hash;
Detect whether the cryptographic Hash is present in default key-value in storage Redis database;
If the cryptographic Hash does not exist in default Redis database, the URL to be crawled is added to and presets team to be crawled
Column, so that the crawler server obtains the URL to be crawled from the default queue to be crawled, and according to described wait crawl
URL crawls new web data;
If the cryptographic Hash is present in the default Redis database, the URL to be crawled is deleted.
7. such as web data crawling method described in any one of claims 1 to 6, which is characterized in that the web data is climbed
Take method further include:
Count the data loading of the data traffic volume of the crawler server and the default storing data library in predetermined period
Amount, and corresponding statistical report is generated according to the data traffic volume and the data loading amount;
The statistical report is sent to default operational terminal, so that staff carries out Data duplication analysis.
8. a kind of web data crawls device, which is characterized in that the web data crawls device and includes:
Fisrt feature extraction module, for the control server in the first webpage number for receiving the crawler server transmission
According to when, to first web data carry out feature extraction, obtain the first Text eigenvector;
First fingerprint generation module, for the preparatory trained fingerprint of first Text eigenvector input to be generated model,
Obtain the first web page fingerprint;
Similarity value judgment module, for calculating the storage web page fingerprint in first web page fingerprint and default storing data library
Between similarity value, and judge in the similarity value with the presence or absence of greater than preset threshold similarity value;
Web data memory module, if for there is no the similarity values greater than preset threshold in the similarity value, by institute
The first web data is stated to be stored in the default storing data library.
9. a kind of web data crawls system, which is characterized in that the web data crawl system include control server and with
Multiple crawler servers of the control server connection further include memory, processor and are stored on the memory
And program can be crawled by the web data that the processor executes, it is held wherein the web data crawls program by the processor
When row, the step of realizing web data crawling method as described in any one of claims 1 to 7.
10. a kind of computer readable storage medium, which is characterized in that be stored with webpage number on the computer readable storage medium
According to crawling program, wherein the web data crawls program when being executed by processor, such as any one of claims 1 to 7 is realized
The step of described web data crawling method.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910113261.7A CN109918554A (en) | 2019-02-13 | 2019-02-13 | Web data crawling method, device, system and computer readable storage medium |
PCT/CN2019/118144 WO2020164276A1 (en) | 2019-02-13 | 2019-11-13 | Webpage data crawling method, apparatus and system, and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910113261.7A CN109918554A (en) | 2019-02-13 | 2019-02-13 | Web data crawling method, device, system and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109918554A true CN109918554A (en) | 2019-06-21 |
Family
ID=66961585
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910113261.7A Pending CN109918554A (en) | 2019-02-13 | 2019-02-13 | Web data crawling method, device, system and computer readable storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109918554A (en) |
WO (1) | WO2020164276A1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110275974A (en) * | 2019-06-28 | 2019-09-24 | 武汉轻工大学 | Data adding method, device, terminal device and the storage medium of sample data set |
CN110691125A (en) * | 2019-09-24 | 2020-01-14 | 上海富数科技有限公司 | System and method for realizing browser loading control based on heuristic algorithm |
CN111367962A (en) * | 2020-02-28 | 2020-07-03 | 北京金堤科技有限公司 | Database updating method and device, computer readable storage medium and electronic equipment |
CN111428179A (en) * | 2020-03-19 | 2020-07-17 | 北大方正集团有限公司 | Picture monitoring method and device and electronic equipment |
CN111538925A (en) * | 2020-04-09 | 2020-08-14 | 支付宝(杭州)信息技术有限公司 | Method and device for extracting Uniform Resource Locator (URL) fingerprint features |
WO2020164276A1 (en) * | 2019-02-13 | 2020-08-20 | 平安科技(深圳)有限公司 | Webpage data crawling method, apparatus and system, and computer-readable storage medium |
CN112100473A (en) * | 2020-09-21 | 2020-12-18 | 工业互联网创新中心(上海)有限公司 | Crawler method based on application interface, terminal and storage medium |
CN112836111A (en) * | 2021-02-09 | 2021-05-25 | 沈阳麟龙科技股份有限公司 | URL crawling method, device, medium and electronic equipment of crawler system |
CN112948654A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Webpage crawling method and device and computer equipment |
CN113297525A (en) * | 2021-06-17 | 2021-08-24 | 恒安嘉新(北京)科技股份公司 | Webpage classification method and device, electronic equipment and storage medium |
CN113704586A (en) * | 2021-08-30 | 2021-11-26 | 泰戈特(北京)工程技术有限公司 | Duplicate removal target page determining method and device, computer equipment and computer readable storage medium |
CN115001955A (en) * | 2022-06-08 | 2022-09-02 | 苏州花园集信息科技有限公司 | Operation and maintenance data acquisition system and method thereof |
CN116894057A (en) * | 2023-07-17 | 2023-10-17 | 云达信息技术有限公司 | Python-based cloud service data collection processing method, device, equipment and medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645082A (en) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | Similar web page duplicate-removing system based on parallel programming mode |
US20130030788A1 (en) * | 2011-07-29 | 2013-01-31 | At&T Intellectual Property I, L.P. | System and method for locating bilingual web sites |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
US20160335349A1 (en) * | 2015-05-13 | 2016-11-17 | Quixey, Inc. | Operator-Guided Application Crawling Architecture |
CN106598984A (en) * | 2015-10-16 | 2017-04-26 | 北京国双科技有限公司 | Data processing method and device of web crawler |
CN107590188A (en) * | 2017-08-08 | 2018-01-16 | 杭州灵皓科技有限公司 | A kind of reptile crawling method and its management system for automating vertical subdivision field |
CN108132948A (en) * | 2016-11-30 | 2018-06-08 | 北京国双科技有限公司 | Handle the method and apparatus for crawling webpage |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126512A (en) * | 2016-04-13 | 2016-11-16 | 北京天融信网络安全技术有限公司 | The Web page classification method of a kind of integrated study and device |
CN108647263B (en) * | 2018-04-28 | 2022-04-12 | 淮阴工学院 | Network address confidence evaluation method based on webpage segmentation crawling |
CN109918554A (en) * | 2019-02-13 | 2019-06-21 | 平安科技(深圳)有限公司 | Web data crawling method, device, system and computer readable storage medium |
-
2019
- 2019-02-13 CN CN201910113261.7A patent/CN109918554A/en active Pending
- 2019-11-13 WO PCT/CN2019/118144 patent/WO2020164276A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645082A (en) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | Similar web page duplicate-removing system based on parallel programming mode |
US20130030788A1 (en) * | 2011-07-29 | 2013-01-31 | At&T Intellectual Property I, L.P. | System and method for locating bilingual web sites |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
US20160335349A1 (en) * | 2015-05-13 | 2016-11-17 | Quixey, Inc. | Operator-Guided Application Crawling Architecture |
CN106598984A (en) * | 2015-10-16 | 2017-04-26 | 北京国双科技有限公司 | Data processing method and device of web crawler |
CN108132948A (en) * | 2016-11-30 | 2018-06-08 | 北京国双科技有限公司 | Handle the method and apparatus for crawling webpage |
CN107590188A (en) * | 2017-08-08 | 2018-01-16 | 杭州灵皓科技有限公司 | A kind of reptile crawling method and its management system for automating vertical subdivision field |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020164276A1 (en) * | 2019-02-13 | 2020-08-20 | 平安科技(深圳)有限公司 | Webpage data crawling method, apparatus and system, and computer-readable storage medium |
CN110275974A (en) * | 2019-06-28 | 2019-09-24 | 武汉轻工大学 | Data adding method, device, terminal device and the storage medium of sample data set |
CN110691125A (en) * | 2019-09-24 | 2020-01-14 | 上海富数科技有限公司 | System and method for realizing browser loading control based on heuristic algorithm |
CN112948654A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Webpage crawling method and device and computer equipment |
CN111367962A (en) * | 2020-02-28 | 2020-07-03 | 北京金堤科技有限公司 | Database updating method and device, computer readable storage medium and electronic equipment |
CN111367962B (en) * | 2020-02-28 | 2024-01-30 | 北京金堤科技有限公司 | Database updating method and device, computer readable storage medium and electronic equipment |
CN111428179A (en) * | 2020-03-19 | 2020-07-17 | 北大方正集团有限公司 | Picture monitoring method and device and electronic equipment |
CN111428179B (en) * | 2020-03-19 | 2023-09-19 | 新方正控股发展有限责任公司 | Picture monitoring method and device and electronic equipment |
CN111538925A (en) * | 2020-04-09 | 2020-08-14 | 支付宝(杭州)信息技术有限公司 | Method and device for extracting Uniform Resource Locator (URL) fingerprint features |
CN111538925B (en) * | 2020-04-09 | 2023-05-02 | 支付宝(中国)网络技术有限公司 | Uniform resource locator URL fingerprint feature extraction method and device |
CN112100473A (en) * | 2020-09-21 | 2020-12-18 | 工业互联网创新中心(上海)有限公司 | Crawler method based on application interface, terminal and storage medium |
CN112836111A (en) * | 2021-02-09 | 2021-05-25 | 沈阳麟龙科技股份有限公司 | URL crawling method, device, medium and electronic equipment of crawler system |
CN113297525A (en) * | 2021-06-17 | 2021-08-24 | 恒安嘉新(北京)科技股份公司 | Webpage classification method and device, electronic equipment and storage medium |
CN113297525B (en) * | 2021-06-17 | 2023-12-12 | 恒安嘉新(北京)科技股份公司 | Webpage classification method, device, electronic equipment and storage medium |
CN113704586A (en) * | 2021-08-30 | 2021-11-26 | 泰戈特(北京)工程技术有限公司 | Duplicate removal target page determining method and device, computer equipment and computer readable storage medium |
CN115001955A (en) * | 2022-06-08 | 2022-09-02 | 苏州花园集信息科技有限公司 | Operation and maintenance data acquisition system and method thereof |
CN116894057A (en) * | 2023-07-17 | 2023-10-17 | 云达信息技术有限公司 | Python-based cloud service data collection processing method, device, equipment and medium |
CN116894057B (en) * | 2023-07-17 | 2023-12-22 | 云达信息技术有限公司 | Python-based cloud service data collection processing method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
WO2020164276A1 (en) | 2020-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109918554A (en) | Web data crawling method, device, system and computer readable storage medium | |
US20200304550A1 (en) | Generic Event Stream Processing for Machine Learning | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
CN111143576A (en) | Event-oriented dynamic knowledge graph construction method and device | |
CN107436875A (en) | File classification method and device | |
CN105389307A (en) | Statement intention category identification method and apparatus | |
JP2019133621A (en) | Collection of api documentation | |
CN110309446A (en) | The quick De-weight method of content of text, device, computer equipment and storage medium | |
CN111866004B (en) | Security assessment method, apparatus, computer system, and medium | |
CN108021651A (en) | Network public opinion risk assessment method and device | |
CN109614319B (en) | Automatic testing method and device, electronic equipment and computer readable medium | |
CN110647995A (en) | Rule training method, device, equipment and storage medium | |
WO2019227711A1 (en) | Method and apparatus for generating influenza prediction model, and computer-readable storage medium | |
CN111324797A (en) | Method and device for acquiring data accurately at high speed | |
CN109325122A (en) | Vocabulary generation method, file classification method, device, equipment and storage medium | |
CN107247789A (en) | user interest acquisition method based on internet | |
CN110347806A (en) | Original text discriminating method, device, equipment and computer readable storage medium | |
CN110069686A (en) | User behavior analysis method, apparatus, computer installation and storage medium | |
CN110083809A (en) | Contract terms similarity calculating method, device, equipment and readable storage medium storing program for executing | |
CN107688594B (en) | The identifying system and method for risk case based on social information | |
CN110020214B (en) | Knowledge-fused social network streaming event detection system | |
Xhafa et al. | Apache Mahout's k-Means vs Fuzzy k-Means Performance Evaluation | |
CN110347934A (en) | A kind of text data filtering method, device and medium | |
CN110489759A (en) | Text feature weighting and short text similarity calculation method, system and medium based on word frequency | |
CN114896141A (en) | Test case duplication removing method, device, equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |