CN108900623A - A kind of Web page text acquisition methods and device based on dynamic IP - Google Patents

A kind of Web page text acquisition methods and device based on dynamic IP Download PDF

Info

Publication number
CN108900623A
CN108900623A CN201810760579.XA CN201810760579A CN108900623A CN 108900623 A CN108900623 A CN 108900623A CN 201810760579 A CN201810760579 A CN 201810760579A CN 108900623 A CN108900623 A CN 108900623A
Authority
CN
China
Prior art keywords
address
sliding block
vps
proxy server
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810760579.XA
Other languages
Chinese (zh)
Other versions
CN108900623B (en
Inventor
董新建
董瑞朝
李贞�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bonnie Group Co Ltd
Original Assignee
Bonnie Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bonnie Group Co Ltd filed Critical Bonnie Group Co Ltd
Priority to CN201810760579.XA priority Critical patent/CN108900623B/en
Publication of CN108900623A publication Critical patent/CN108900623A/en
Application granted granted Critical
Publication of CN108900623B publication Critical patent/CN108900623B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/50Address allocation
    • H04L61/5007Internet protocol [IP] addresses
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of Web page text acquisition methods and device based on dynamic IP.This method include in network corresponding first IP address of multiple Virtual Private Server vps and sliding block proxy server be monitored;Vps one the second IP address of dynamic generation again is indicated if vps meets switching condition, and corresponding first IP address of vps is switched to the second IP address;By sliding block proxy server labeled as unavailable if sliding block proxy server meets unavailable condition, so that crawler server is by the second IP address and available sliding block proxy server to carrying out text acquisition.Device is for executing the above method.The present invention is by verifying the availability of the second IP address after the second IP address of dynamic generation, first IP address is switched to the second IP address if available, it cannot be used for the case where webpage progress text to be crawled crawls so as to avoid the second IP address of generation, improve and crawl efficiency.

Description

A kind of Web page text acquisition methods and device based on dynamic IP
Technical field
The present invention relates to technical field of network security, obtain in particular to a kind of Web page text based on dynamic IP Method and device.
Background technique
Before many years, network big data has been defined as " nonpetroleum " of information age by brainstrust.Also, it is logical The continuous refinement and distillation of various network big data processing techniques are crossed, the value of network big data is in commercial affairs, medical treatment, the energy And the numerous areas such as internet have obtained sufficient proof, while also creating huge income.
As the outburst of blowout is presented in network big data application in recent years, also concern acquisition network is big further for all trades and professions The mode of data and source.Wherein, web crawlers technology is exactly to obtain a kind of mode of network big data;On the one hand, network is climbed Worm can use the opening of internet, the search and collection to being customized of network data source;On the other hand, web crawlers The battalion of mass network big data can also be drawn using the feature that internet information amount is big, timeliness is high and data class is abundant It supports.
But the trial of strength that crawls such as lance and shield of the web crawlers for network big data, for guarding website itself Network big data, number of site can take certain means to prevent web crawlers from crawling to network big data, such as can wrap The request that the modes such as setting identifying code, subscriber blacklist, COOKIE encryption and IP block limit web crawlers is included, this is for net The acquisition of network big data causes big inconvenience.
Summary of the invention
In view of this, the embodiment of the present invention be designed to provide a kind of Web page text acquisition methods based on dynamic IP and Device, to solve the above technical problems.
In a first aspect, the embodiment of the invention provides a kind of Web page text acquisition methods based on dynamic IP, including:
Corresponding first IP address of multiple Virtual Private Server vps corresponding to crawler server in network, and it is sliding Block proxy server is monitored;
If the vps meets switching condition, the vps one the second IP address of dynamic generation again is indicated, it will be described Corresponding first IP address of vps is switched to second IP address;
If the sliding block proxy server meets unavailable condition, the sliding block proxy server is labeled as can not With so that the crawler server carries out the text of webpage by second IP address and available sliding block proxy server It obtains;
Wherein, the switching condition includes following any one or combinations thereof:
The use duration of corresponding first IP address of the vps is more than the first preset duration;
The access times of corresponding first IP address of the vps are more than the first preset times;
When carrying out acquisition web page contents by first IP address, there is the prompt information of frequent operation;
The unavailable condition includes following any one or combinations thereof:
The access times of the sliding block proxy server are more than the second preset times;
The use duration of the sliding block proxy server is more than the second preset duration;
The sliding block proxy server has carried out the filtering characters string operation of third preset times when being verified.
Further, described that corresponding first IP address of the vps is switched to second IP address, including:
According to webpage information to be crawled, if judge to know second IP address as IP available, the vps is corresponding First IP address is switched to second IP address.
Further, described that corresponding first IP address of the vps is switched to second IP address, including:
If judging, second IP address, will be described away from being not used in the preset time period before current time First IP address is switched to second IP address.
Further, if the sliding block proxy server meets unavailable condition, by the sliding block agency service Device labeled as unavailable, including:
If the available number of sliding block proxy server is more than preset threshold, the sliding block of slide failure is acted on behalf of Server-tag is unavailable.
Further, the method further includes:
Configuration not storage and monitoring time segment, in the not storage and monitoring time segment not to the vps and the sliding block proxy server into Row monitoring.
Further, the method further includes:
Receive the IP call request that the crawler server is sent, the IP call request includes webpage to be crawled Location;
It is that crawler server distribution can use IP address according to the web page address and the switching condition, so that institute The corresponding webpage to be crawled of the web page address can be crawled according to IP address by stating crawler server.
Further, described that corresponding first IP address of the vps is switched to second IP address, including:
Corresponding first IP address of the vps is switched to second IP address based on WSGI interface.
Further, described available for crawler server distribution according to the web page address and the switching condition IP address, including:
It is used with the presence or absence of the crawler server tune in inquiry IP caching if described when needing to log in wait crawl webpage History IP;
If in the IP caching, there are the history IP, and the history IP is sent to the crawler server, so that The crawler server carries out web page crawl using the history IP.
Further, described available for crawler server distribution according to the web page address and the switching condition IP address, including:
If described when not needing to log in wait crawl webpage, never call IP is obtained from IP database, and by the never call IP is sent to the crawler server, so that the crawler server carries out web page crawl using the never call IP.
Further, described available for crawler server distribution according to the web page address and the switching condition IP address, including:
If there is no the history IP in the IP caching, never call IP is obtained from IP database, and do not adjust described It is sent to the crawler server with IP, so that the crawler server carries out web page crawl using the never call IP.
Further, the method further includes:
It receives the crawler server and crawls and described carried out wait crawl the text message in webpage, and to the text message Storage.
Further, the method further includes:
Emotion recognition is carried out to the text in the webpage using sentiment classification model, obtains emotional category;The emotion Classification includes positive, neutral and passive.
Further, the method further includes:
First nerves network, nervus opticus network and third nerve network, the first nerves network is constructed respectively to be used for Identify positive emotion, nervus opticus network neutral emotion for identification, third nerve network Negative Affect for identification;
Using training dataset to respectively to the first nerves network, the nervus opticus network and the third nerve Network is trained, and it is corresponding second defeated to obtain the first nerves network corresponding first output result, nervus opticus network Result and the corresponding third of third nerve network export result out;
Calculate first-loss function, the second output result and the mark between the first output result and label result Sign the third loss function between the second loss function, third output result and the label result between result;
The parameter in the first nerves network is optimized using the first-loss function, utilizes second damage It loses function to optimize the parameter in the nervus opticus network, using the third loss function to the third nerve net Parameter in network optimizes;
The nervus opticus network after first nerves network, optimization after optimization and the third nerve network after optimization constitute institute State sentiment classification model.
Further, the first-loss function calculated between the first output result and label result, described second defeated The between the second loss function between result and the label result, third output result and the label result out Three loss functions, including:
Obtain corresponding first output matrix of the first output result, corresponding second output of the second output result Matrix, the corresponding third output matrix of the third export structure and the corresponding label matrix of the label result;
It is calculated according to Euclidean distance calculation formula first European between first output matrix and the label matrix Distance obtains the first-loss function according to first Euclidean distance;
Second between second output matrix and the label matrix is calculated according to the Euclidean distance calculation formula Euclidean distance obtains second loss function according to second Euclidean distance;
The third between the third output matrix and the label matrix is calculated according to the Euclidean distance calculation formula Euclidean distance obtains the third loss function according to the third Euclidean distance.
Further, the Euclidean distance calculation formula, including:
According toThe Euclidean distance in the first matrix between two row vectors is calculated, obtains first Intermediary matrixWherein dakjFor in first matrix between row k vector sum jth row vector it is European away from From akFor row k element value in first matrix, ajFor jth row element value in first matrix;
According toIt calculates the Euclidean distance in second matrix between two row vectors, obtains the Two intermediary matrixsWherein dbkjFor in second matrix between row k vector sum jth row vector it is European away from From bkFor row k element value in second matrix, bjFor jth row element value in second matrix;
According toCalculate the intermediate Euclidean distance of second intermediary matrix and the first intermediary matrix
Wherein, first matrix includes first output matrix, second output matrix or third output Matrix;Second matrix is the label matrix.
Second aspect, the embodiment of the invention provides a kind of Web page text acquisition device based on dynamic IP, including:
Monitoring module is used for multiple Virtual Private Server vpss corresponding first corresponding to crawler server in network IP address and sliding block proxy server are monitored;
Switching module indicates vps dynamic generation one second again if meeting switching condition for the vps Corresponding first IP address of the vps is switched to second IP address by IP address;
Unavailable labeling module, if meeting unavailable condition for the sliding block proxy server, by the sliding block generation Manage server-tag be it is unavailable so that the crawler server passes through second IP address and available sliding block agency service Device obtains the text of webpage;
Wherein, the switching condition includes following any one or combinations thereof:
The use duration of corresponding first IP address of the vps is more than the first preset duration;
The access times of corresponding first IP address of the vps are more than the first preset times;
When carrying out acquisition web page contents by first IP address, there is the prompt information of frequent operation;
The unavailable condition includes following any one or combinations thereof:
The access times of the sliding block proxy server are more than the second preset times;
The use duration of the sliding block proxy server is more than the second preset duration;
The sliding block proxy server has carried out the filtering characters string operation of third preset times when being verified.
Further, the switching module, is specifically used for:
According to the webpage to be crawled, if second IP address is IP available, by the vps corresponding described first IP address is switched to second IP address.
Further, the switching module, is specifically used for:
If second IP address in the preset time period before current time away from being not used, by described first IP address is switched to second IP address.
Further, the unavailable labeling module, is specifically used for:
If the available number of sliding block proxy server is more than preset threshold, the sliding block of slide failure is acted on behalf of Server-tag is unavailable.
Further, the monitoring module, is also used to:
Configuration not storage and monitoring time segment, in the not storage and monitoring time segment not to the vps and the sliding block proxy server into Row monitoring.
Further, described device further includes:
First receiving module, the IP call request sent for receiving the crawler server, the IP call request packet Include web page address to be crawled;
Module is crawled, for being that crawler server distribution is available according to the web page address and the switching condition IP address, so that the crawler server can crawl the corresponding webpage to be crawled of the web page address according to IP address.
Further, the switching module, is specifically used for:
Corresponding first IP address of the vps is switched to second IP address based on WSGI interface.
Further, described to crawl module, it is specifically used for:
It is used with the presence or absence of the crawler server tune in inquiry IP caching if described when needing to log in wait crawl webpage History IP;
If in the IP caching, there are the history IP, and the history IP is sent to the crawler server, so that The crawler server carries out web page crawl using the history IP.
Further, described to crawl module, it is specifically used for:
If described when not needing to log in wait crawl webpage, never call IP is obtained from IP database, and by the never call IP is sent to the crawler server, so that the crawler server carries out web page crawl using the never call IP.
Further, described to crawl module, it is specifically used for:
If there is no the history IP in the IP caching, never call IP is obtained from IP database, and do not adjust described It is sent to the crawler server with IP, so that the crawler server carries out web page crawl using the never call IP.
Further, described device further includes:
Second receiving module, receive the crawler server crawl it is described wait crawl the text message in webpage, and to institute Text message is stated to be stored.
Further, described device further includes:
Emotion recognition module is obtained for carrying out emotion recognition to the text in the webpage using sentiment classification model Emotional category;The emotional category includes positive, neutral and passive.
Further, described device further includes model training module, is used for:
First nerves network, nervus opticus network and third nerve network, the first nerves network is constructed respectively to be used for Identify positive emotion, nervus opticus network neutral emotion for identification, third nerve network Negative Affect for identification;
Using training dataset to respectively to the first nerves network, the nervus opticus network and the third nerve Network is trained, and it is corresponding second defeated to obtain the first nerves network corresponding first output result, nervus opticus network Result and the corresponding third of third nerve network export result out;
Calculate first-loss function, the second output result and the mark between the first output result and label result Sign the third loss function between the second loss function, third output result and the label result between result;
The parameter in the first nerves network is optimized using the first-loss function, utilizes second damage It loses function to optimize the parameter in the nervus opticus network, using the third loss function to the third nerve net Parameter in network optimizes;
The nervus opticus network after first nerves network, optimization after optimization and the third nerve network after optimization constitute institute State sentiment classification model.
Further, the model training module, is specifically used for:
Obtain corresponding first output matrix of the first output result, corresponding second output of the second output result Matrix, the corresponding third output matrix of the third export structure and the corresponding label matrix of the label result;
It is calculated according to Euclidean distance calculation formula first European between first output matrix and the label matrix Distance obtains the first-loss function according to first Euclidean distance;
Second between second output matrix and the label matrix is calculated according to the Euclidean distance calculation formula Euclidean distance obtains second loss function according to second Euclidean distance;
The third between the third output matrix and the label matrix is calculated according to the Euclidean distance calculation formula Euclidean distance obtains the third loss function according to the third Euclidean distance.
Further, the model training module, is specifically used for:
According toThe Euclidean distance in the first matrix between two row vectors is calculated, obtains first Intermediary matrixWherein dakjFor in first matrix between row k vector sum jth row vector it is European away from From akFor row k element value in first matrix, ajFor jth row element value in first matrix;
According toThe Euclidean distance in second matrix between two row vectors is calculated, is obtained Second intermediary matrixWherein dbkjIt is European between row k vector sum jth row vector in second matrix Distance, bkFor row k element value in second matrix, bjFor jth row element value in second matrix;
According toCalculate the intermediate Euclidean distance of second intermediary matrix and the first intermediary matrix
Wherein, first matrix includes first output matrix, second output matrix or third output Matrix;Second matrix is the label matrix.
The third aspect, the embodiment of the present invention provide a kind of electronic equipment, including:Processor, memory and bus, wherein
The processor and the memory complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to Enable the method and step for being able to carry out first aspect.
Fourth aspect, the embodiment of the present invention provide a kind of non-transient computer readable storage medium, including:
The non-transient computer readable storage medium stores computer instruction, and the computer instruction makes the computer Execute the method and step of first aspect.
The embodiment of the present invention by being verified to the availability of the second IP address after the second IP address of dynamic generation, If be available, the first IP address is switched to the second IP address, cannot be used so as to avoid the second IP address of generation The case where text crawls is carried out in webpage to be crawled, improves and crawls efficiency.
Other features and advantages of the present invention will be illustrated in subsequent specification, also, partly be become from specification It is clear that by implementing understanding of the embodiment of the present invention.The objectives and other advantages of the invention can be by written theory Specifically noted structure is achieved and obtained in bright book, claims and attached drawing.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.
Fig. 1 is a kind of Web page text acquisition methods flow diagram based on dynamic IP provided in an embodiment of the present invention;
Fig. 2 is network architecture diagram provided in an embodiment of the present invention;
Fig. 3 is a kind of Web page text acquisition device structural schematic diagram based on dynamic IP provided in an embodiment of the present invention;
Fig. 4 shows a kind of structural block diagram that can be applied to the electronic equipment in the embodiment of the present application.
Specific embodiment
Below in conjunction with attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Usually exist The component of the embodiment of the present invention described and illustrated in attached drawing can be arranged and be designed with a variety of different configurations herein.Cause This, is not intended to limit claimed invention to the detailed description of the embodiment of the present invention provided in the accompanying drawings below Range, but it is merely representative of selected embodiment of the invention.Based on the embodiment of the present invention, those skilled in the art are not doing Every other embodiment obtained under the premise of creative work out, shall fall within the protection scope of the present invention.
It should be noted that:Similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, does not then need that it is further defined and explained in subsequent attached drawing.Meanwhile of the invention In description, term " first ", " second " etc. are only used for distinguishing description, are not understood to indicate or imply relative importance.
Fig. 1 is a kind of Web page text acquisition methods flow diagram based on dynamic IP provided in an embodiment of the present invention, such as Shown in Fig. 1, this method, including:
Step 101:Corresponding first IP of multiple Virtual Private Server vps corresponding to crawler server in network Location and sliding block proxy server are monitored.
In the specific implementation process, Fig. 2 is network architecture diagram provided in an embodiment of the present invention, as shown in Fig. 2, in network It include multiple vps, crawler proxy server, real-time crawler server and offline crawler server.Wherein, vps dynamic generation IP address, crawler server treat the text for crawling webpage by crawler proxy server using the IP address of vps dynamic generation It is crawled.Device in network corresponding first IP address of vps and sliding block proxy server be monitored, to sentence Whether disconnected first IP address needs to switch and sliding block proxy server whether need to be labeled as it is unavailable.It should be noted that It in monitoring, can be monitored with real-time perfoming, be also possible to be monitored according to preset time period intermittence.
Step 102:If the vps meets switching condition, vps dynamic generation one the 2nd IP again is indicated Corresponding first IP address of the vps is switched to second IP address by location.
In the specific implementation process, during monitoring, if it find that some vps meets switching condition, then illustrate There is the risk to be piped off by the corresponding server of webpage to be crawled in corresponding first IP address of this vps, therefore can refer to Show this vps will one the second IP address of dynamic generation again, the first IP address is replaced with the second IP address.
Wherein, switching condition includes following any one or combinations thereof:
The use duration of corresponding first IP address of vps is more than the first preset duration;First preset duration can be set as 2 A hour;
The access times of corresponding first IP address of vps are more than the first preset times;Wherein, the first preset times can be 6 times;
When carrying out acquisition web page contents by the first IP address, there is the prompt information of frequent operation;Because only that going out What the existing page jumped out " frequent operation " can just be added blacklist, and (presently found blacklist is all the whole nation caused by sealing) is black It not will use in IP address 3 days in list and (can attempt to use when available agent is seldom).
When meeting at least one in above-mentioned switching condition, the corresponding first IP address needs of the vps are switched over.
Step 103:If the sliding block proxy server meets unavailable condition, the sliding block proxy server is marked To be unavailable so that the crawler server by second IP address and available sliding block proxy server to webpage just Text is obtained.
In the specific implementation process, during monitoring, if learning that some sliding block proxy server satisfaction can not With condition, then illustrate to need to stop the use to the sliding block proxy server, and the sliding block proxy server is labeled as can not With, thus, crawler server is by crawler proxy server using the second IP address and available sliding block proxy server to net Text in page is obtained.
Wherein, should be by the condition that sliding block proxy server carries out unavailable label:
If the number that some sliding block proxy server is used continuously has been more than the second preset times, by the sliding block generation It is unavailable for managing server-tag, wherein the second preset times can be 6 times;In addition, taking the number of details is more than 50 times, It is marked as unavailable;
If it has been more than the second preset duration that some sliding block proxy server, which is used continuously duration, need this Sliding block proxy server is labeled as unavailable, wherein the second preset duration can be 2 hours;
If some sliding block proxy server in verifying, is grasped in the event of the filtering characters string of third preset times Make, then need for this sliding block proxy server to be labeled as unavailable, wherein third preset times can be 2 times.
It is learnt by above-mentioned analysis, needs to obtain available second IP address in crawler server and be used to treat to crawl Text in webpage is crawled, and when if necessary to carry out sliding block verifying, then needs to be carried out with available sliding block proxy server Verifying, thus realize treat crawl webpage carry out text crawl.
The embodiment of the present invention judges whether corresponding first IP address of vps needs to switch over by switching condition, if It needs to switch, then one the second IP address of dynamic generation, and judge that sliding block proxy server whether should according to unavailable condition Be labeled as it is unavailable, thus guarantee IP address used be it is effective, to improve the efficiency that crawler server crawls text.
On the basis of the above embodiments, described that corresponding first IP address of the vps is switched to described second IP address, including:
According to the webpage to be crawled, if judge to know second IP address as IP available, the vps is corresponding First IP address is switched to second IP address.
In the specific implementation process, since different webpages to be crawled can only allow the IP address of certain class to access, Therefore it may determine that whether second IP address is available IP address according to first 3 sections of the second IP address, if it is available , then the first IP address is replaced using the second IP address.
The embodiment of the present invention by being verified to the availability of the second IP address after the second IP address of dynamic generation, If be available, the first IP address is switched to the second IP address, cannot be used so as to avoid the second IP address of generation The case where text crawls is carried out in webpage to be crawled, improves and crawls efficiency.
On the basis of the above embodiments, described that corresponding first IP address of the vps is switched to described second IP address, including:
If judging, second IP address, will be described away from being not used in the preset time period before current time First IP address is switched to second IP address.
In the specific implementation process, before the second IP address is replaced the first IP address, with judging this 2nd IP Location is either with or without being most recently used, if the second IP address was not used in away from the preset time period before current time, For example, all not using this second IP address within the previous day, then illustrate that this second IP address can be used, at this point, First IP address is switched to the second IP address.
The embodiment of the present invention, which passes through, judges whether the second IP address generated used within nearest a period of time, if It did not use and just switched over, to ensure that the safety for crawl by the second IP address text.
It on the basis of the above embodiments, will be described if the sliding block proxy server meets unavailable condition Sliding block proxy server labeled as unavailable, including:
If the available number of sliding block proxy server is more than preset threshold, the sliding block of slide failure is acted on behalf of Server-tag is unavailable.
In the specific implementation process, when the number of available sliding block proxy server be more than preset threshold when, explanation The number of available sliding block proxy server is more sufficient, if sliding block proxy server when executing slide, if sliding Dynamic failure then illustrates that the sliding block proxy server has the risk for being added into blacklist, at this time by the sliding block proxy server mark It is denoted as unavailable.
The embodiment of the present invention by judge currently available sliding block proxy server abundance when, by slide unsuccessfully cunning Block proxy server is labeled as unavailable, to reduce the risk that sliding block proxy server is added into blacklist.
On the basis of the above embodiments, the method further includes:
Configuration not storage and monitoring time segment, in the not storage and monitoring time segment not to the vps and the sliding block proxy server into Row monitoring.
In the specific implementation process, due in daily 3:00 AM to 6 points without crawling task, therefore, there is no need to The first IP address corresponding to vps switches over, the advantage of doing so is that the power consumption of server can be reduced.
On the basis of the above embodiments, the method further includes:
Receive the IP call request that the crawler server is sent, the IP call request includes webpage to be crawled Location;
It is that crawler server distribution can use IP address according to the web page address and the switching condition, so that institute The corresponding webpage to be crawled of the web page address can be crawled according to IP address by stating crawler server.
In the specific implementation process, device receives the IP call request that crawler server is sent, in the IP call request Including web page address to be crawled, wherein web page address is a URL, is the crawler according to web page address and switching condition Server distributes an available IP address, enables crawler server to crawl web page address according to available IP address and corresponds to Webpage to be crawled.
The embodiment of the present invention judges whether corresponding first IP address of vps needs to switch over by switching condition, if It needs to switch, then one the second IP address of dynamic generation, and judge that sliding block proxy server whether should according to unavailable condition Be labeled as it is unavailable, thus guarantee IP address used be it is effective, to improve the efficiency that crawler server crawls text.
On the basis of the above embodiments, described that corresponding first IP address of the vps is switched to described second IP address, including:
Corresponding first IP address of the vps is switched to second IP address based on WSGI interface.
On the basis of the above embodiments, it is described according to the web page address and the switching condition be the crawler take The distribution of business device can use IP address, including:
It is used with the presence or absence of the crawler server tune in inquiry IP caching if described when needing to log in wait crawl webpage History IP;
If in the IP caching, there are the history IP, and the history IP is sent to the crawler server, so that The crawler server carries out web page crawl using the history IP.
In the specific implementation process, when getting the web page address wait crawl, the website to be crawled is first determined whether Whether need to log in, when judgement needs to log in, whether there is the used history IP of the crawler server tune in inquiry IP caching, Since when the above-mentioned website wait crawl needs to log in, browser can generate Cookie after login, carry Cookie parameter to net Station make requests will not usually be rejected or it is counter climb, so for IP anti-probability of climbing with regard to relatively low.In inquiring IP caching When IP address used there are crawler server, then continues to use the history IP address and crawled, the advantage of doing so is that The utilization rate of IP address can be improved.
On the basis of the above embodiments, it if do not need to log in wait crawl webpage, obtains from IP database and does not adjust With IP, and the never call IP is sent to the crawler server, so that the crawler server utilizes the never call IP Carry out web page crawl.It should be noted that this never call IP is the corresponding available IP address of vps.
On the basis of the above embodiments, if there is no the used history IP of crawler server in IP caching, from IP number According to acquisition never call IP in library, and the never call IP is sent to crawler server, is not adjusted so that the crawler server utilizes Web page crawl is carried out with IP.
On the basis of the above embodiments, the method further includes:
It receives the crawler server and crawls and described carried out wait crawl the text message in webpage, and to the text message Storage.
In the specific implementation process, after crawler server crawls the text message of webpage by the second IP address, Text message is sent to device, device is stored after receiving the text message.
Furthermore it is also possible to be identified to the emotion of the text message crawled, before recognition, it is necessary first to construct feelings Feel disaggregated model, and sentiment classification model is trained.Specific training process is as follows:
First nerves network, nervus opticus network and third nerve network are constructed first, and wherein first nerves network is used for Identify positive emotion, nervus opticus network neutral emotion, third nerve network Negative Affect for identification for identification.It will train The word of predetermined number in data set is separately input in first nerves network, nervus opticus network and third nerve network, Obtain the corresponding first output result of first nerves network, the corresponding second output result of nervus opticus network and third nerve net The corresponding third of network exports result.And result, the second output result and third output result are exported by first and carries out matrix change It changes, obtains corresponding first output matrix, the second output matrix and third output matrix.Then first nerves be will enter into The emotion of network, nervus opticus network and the word in third nerve network carries out classification mark, and constructs corresponding label square Battle array, then calculates the first output matrix the first Euclidean distance corresponding with label matrix, the second output matrix and label matrix pair The second Euclidean distance and third output matrix the third Euclidean distance corresponding with label matrix answered.Utilize the first Euclidean distance structure Build first-loss function, the parameter in first nerves network optimized using first-loss function, using second it is European away from From the second loss function is constructed, the parameter in nervus opticus network is optimized using the second loss function, utilizes third Europe Formula distance building third loss function, optimizes the parameter in third nerve network using third loss function.To After one neural network, nervus opticus network and third nerve network are trained, by first nerves network, nervus opticus net Network and third nerve network merge into sentiment classification model.When using sentiment classification model, each neural network can be to net Field in page text exports an emotional category score, and respectively each neural network sets corresponding weight, according to weight It calculates and obtains the corresponding emotional category of entire Web page text.It should be noted that emotional category may include positive, neutral and disappear Pole.
It should be noted that the calculation of Euclidean distance is:
According toThe Euclidean distance in the first matrix between two row vectors is calculated, obtains first Intermediary matrixWhereinFor in first matrix between row k vector sum jth row vector it is European away from From akFor row k element value in first matrix, ajFor jth row element value in first matrix;
According toIt calculates the Euclidean distance in second matrix between two row vectors, obtains the Two intermediary matrixsWherein dbkjFor in second matrix between row k vector sum jth row vector it is European away from From bkFor row k element value in second matrix, bjFor jth row element value in second matrix;
According toCalculate the intermediate Euclidean distance of second intermediary matrix and the first intermediary matrix
Wherein, first matrix includes first output matrix, second output matrix or third output Matrix;Second matrix is the label matrix.
Fig. 3 is a kind of Web page text acquisition device structural schematic diagram based on dynamic IP provided in an embodiment of the present invention, such as Shown in Fig. 3, which includes:Monitoring module 301, switching module 302 and unavailable labeling module 303, wherein
Monitoring module 301 is for multiple Virtual Private Server vpss corresponding to crawler server in network corresponding the One IP address and sliding block proxy server are monitored;
If switching module 302 meets switching condition for the vps, vps dynamic generation one the again is indicated Corresponding first IP address of the vps is switched to second IP address by two IP address;
If unavailable labeling module 303 meets unavailable condition for the sliding block proxy server, by the sliding block Proxy server is labeled as unavailable, so that the crawler server is taken by second IP address and available sliding block agency Business device obtains the text of webpage;
Wherein, the switching condition includes following any one or combinations thereof:
The use duration of corresponding first IP address of the vps is more than the first preset duration;
The access times of corresponding first IP address of the vps are more than the first preset times;
When carrying out acquisition web page contents by first IP address, there is the prompt information of frequent operation;
The unavailable condition includes following any one or combinations thereof:
The access times of the sliding block proxy server are more than the second preset times;
The use duration of the sliding block proxy server is more than the second preset duration;
The sliding block proxy server has carried out the filtering characters string operation of third preset times when being verified.
On the basis of the above embodiments, the switching module, is specifically used for:
According to the webpage to be crawled, if second IP address is IP available, by the vps corresponding described first IP address is switched to second IP address.
On the basis of the above embodiments, the switching module, is specifically used for:
If second IP address in the preset time period before current time away from being not used, by described first IP address is switched to second IP address.
On the basis of the above embodiments, the unavailable labeling module, is specifically used for:
If the available number of sliding block proxy server is more than preset threshold, the sliding block of slide failure is acted on behalf of Server-tag is unavailable.
On the basis of the above embodiments, the monitoring module, is also used to:
Configuration not storage and monitoring time segment, in the not storage and monitoring time segment not to the vps and the sliding block proxy server into Row monitoring.
On the basis of the above embodiments, described device further includes:
First receiving module, the IP call request sent for receiving the crawler server, the IP call request packet Include web page address to be crawled;
Module is crawled, for being that crawler server distribution is available according to the web page address and the switching condition IP address, so that the crawler server can crawl the corresponding webpage to be crawled of the web page address according to IP address.
On the basis of the above embodiments, the switching module, is specifically used for:
Corresponding first IP address of the vps is switched to second IP address based on WSGI interface.
On the basis of the above embodiments, described to crawl module, it is specifically used for:
It is used with the presence or absence of the crawler server tune in inquiry IP caching if described when needing to log in wait crawl webpage History IP;
If in the IP caching, there are the history IP, and the history IP is sent to the crawler server, so that The crawler server carries out web page crawl using the history IP.
On the basis of the above embodiments, described to crawl module, it is specifically used for:
If described when not needing to log in wait crawl webpage, never call IP is obtained from IP database, and by the never call IP is sent to the crawler server, so that the crawler server carries out web page crawl using the never call IP.
On the basis of the above embodiments, described to crawl module, it is specifically used for:
If there is no the history IP in the IP caching, never call IP is obtained from IP database, and do not adjust described It is sent to the crawler server with IP, so that the crawler server carries out web page crawl using the never call IP.
On the basis of the above embodiments, described device further includes:
Second receiving module, receive the crawler server crawl it is described wait crawl the text message in webpage, and to institute Text message is stated to be stored.
On the basis of the above embodiments, described device further includes:
Emotion recognition module is obtained for carrying out emotion recognition to the text in the webpage using sentiment classification model Emotional category;The emotional category includes positive, neutral and passive.
On the basis of the above embodiments, described device further includes model training module, is used for:
First nerves network, nervus opticus network and third nerve network, the first nerves network is constructed respectively to be used for Identify positive emotion, nervus opticus network neutral emotion for identification, third nerve network Negative Affect for identification;
Using training dataset to respectively to the first nerves network, the nervus opticus network and the third nerve Network is trained, and it is corresponding second defeated to obtain the first nerves network corresponding first output result, nervus opticus network Result and the corresponding third of third nerve network export result out;
Calculate first-loss function, the second output result and the mark between the first output result and label result Sign the third loss function between the second loss function, third output result and the label result between result;
The parameter in the first nerves network is optimized using the first-loss function, utilizes second damage It loses function to optimize the parameter in the nervus opticus network, using the third loss function to the third nerve net Parameter in network optimizes;
The nervus opticus network after first nerves network, optimization after optimization and the third nerve network after optimization constitute institute State sentiment classification model.
On the basis of the above embodiments, the model training module, is specifically used for:
Obtain corresponding first output matrix of the first output result, corresponding second output of the second output result Matrix, the corresponding third output matrix of the third export structure and the corresponding label matrix of the label result;
It is calculated according to Euclidean distance calculation formula first European between first output matrix and the label matrix Distance obtains the first-loss function according to first Euclidean distance;
Second between second output matrix and the label matrix is calculated according to the Euclidean distance calculation formula Euclidean distance obtains second loss function according to second Euclidean distance;
The third between the third output matrix and the label matrix is calculated according to the Euclidean distance calculation formula Euclidean distance obtains the third loss function according to the third Euclidean distance.
On the basis of the above embodiments, the model training module, is specifically used for:
According toThe Euclidean distance in the first matrix between two row vectors is calculated, obtains first Intermediary matrixWherein dakjFor in first matrix between row k vector sum jth row vector it is European away from From akFor row k element value in first matrix, ajFor jth row element value in first matrix;
According toThe Euclidean distance in second matrix between two row vectors is calculated, is obtained Second intermediary matrixWherein dbkjIt is European between row k vector sum jth row vector in second matrix Distance, bkFor row k element value in second matrix, bjFor jth row element value in second matrix;
According toCalculate the intermediate Euclidean distance of second intermediary matrix and the first intermediary matrix
Wherein, first matrix includes first output matrix, second output matrix or third output Matrix;Second matrix is the label matrix.
It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description Specific work process, no longer can excessively be repeated herein with reference to the corresponding process in preceding method.
In conclusion the embodiment of the present invention is by after the second IP address of dynamic generation, to the availability of the second IP address It is verified, if be available, the first IP address is switched to the second IP address, so as to avoid the 2nd IP of generation Location cannot be used for webpage to be crawled and carry out the case where text crawls, and improves and crawls efficiency.
The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment.
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment.
Referring to figure 4., Fig. 4 is the structural block diagram of electronic equipment provided in an embodiment of the present invention.Electronic equipment may include Web page text acquisition device 401, memory 402, storage control 403, processor 404, Peripheral Interface 405, input and output list First 406, audio unit 407, display unit 408.
The memory 402, storage control 403, processor 404, Peripheral Interface 405, input-output unit 406, sound Frequency unit 407, each element of display unit 408 are directly or indirectly electrically connected between each other, to realize the transmission or friendship of data Mutually.It is electrically connected for example, these elements can be realized between each other by one or more communication bus or signal wire.The webpage Text acquisition device 401 includes that at least one can be stored in the memory 402 in the form of software or firmware (firmware) In or the software function that is solidificated in the operating system (operating system, OS) of the Web page text acquisition device 401 Module.The processor 404 is for executing the executable module stored in memory 402, such as Web page text acquisition dress Set the software function module or computer program that 401 include.
Wherein, memory 402 may be, but not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc.. Wherein, memory 402 is for storing program, and the processor 404 executes described program after receiving and executing instruction, aforementioned Method performed by the server that the stream process that any embodiment of the embodiment of the present invention discloses defines can be applied to processor 404 In, or realized by processor 404.
Processor 404 can be a kind of IC chip, the processing capacity with signal.Above-mentioned processor 404 can To be general processor, including central processing unit (Central Processing Unit, abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.;Can also be digital signal processor (DSP), specific integrated circuit (ASIC), Ready-made programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hard Part component.It may be implemented or execute disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor It can be microprocessor or the processor 404 be also possible to any conventional processor etc..
Various input/output devices are couple processor 404 and memory 402 by the Peripheral Interface 405.Some In embodiment, Peripheral Interface 405, processor 404 and storage control 403 can be realized in one single chip.Other one In a little examples, they can be realized by independent chip respectively.
Input-output unit 406 realizes user and the server (or local terminal) for being supplied to user input data Interaction.The input-output unit 406 may be, but not limited to, mouse and keyboard etc..
Audio unit 407 provides a user audio interface, may include one or more microphones, one or more raises Sound device and voicefrequency circuit.
Display unit 408 provides an interactive interface (such as user interface) between the electronic equipment and user Or it is referred to for display image data to user.In the present embodiment, the display unit 408 can be liquid crystal display or touching Control display.It can be the touching of the capacitance type touch control screen or resistance-type of support single-point and multi-point touch operation if touch control display Control screen etc..Single-point and multi-point touch operation is supported to refer to that touch control display can sense on the touch control display one or more The touch control operation generated simultaneously at a position, and the touch control operation that this is sensed transfers to processor 404 to be calculated and handled.
Various input/output devices are couple processor 404 and memory 402 by the Peripheral Interface 405.Some In embodiment, Peripheral Interface 405, processor 404 and storage control 403 can be realized in one single chip.Other one In a little examples, they can be realized by independent chip respectively.
Input-output unit 406 is used to be supplied to the interaction that user input data realizes user and processing terminal.It is described defeated Entering output unit 406 may be, but not limited to, mouse and keyboard etc..
It is appreciated that structure shown in Fig. 4 is only to illustrate, the electronic equipment may also include it is more than shown in Fig. 4 or The less component of person, or with the configuration different from shown in Fig. 4.Each component shown in Fig. 4 can using hardware, software or A combination thereof is realized.
In several embodiments provided herein, it should be understood that disclosed device and method can also pass through Other modes are realized.The apparatus embodiments described above are merely exemplary, for example, flow chart and block diagram in attached drawing Show the device of multiple embodiments according to the present invention, the architectural framework in the cards of method and computer program product, Function and operation.In this regard, each box in flowchart or block diagram can represent the one of a module, section or code Part, a part of the module, section or code, which includes that one or more is for implementing the specified logical function, to be held Row instruction.It should also be noted that function marked in the box can also be to be different from some implementations as replacement The sequence marked in attached drawing occurs.For example, two continuous boxes can actually be basically executed in parallel, they are sometimes It can execute in the opposite order, this depends on the function involved.It is also noted that every in block diagram and or flow chart The combination of box in a box and block diagram and or flow chart can use the dedicated base for executing defined function or movement It realizes, or can realize using a combination of dedicated hardware and computer instructions in the system of hardware.
In addition, each functional module in each embodiment of the present invention can integrate one independent portion of formation together Point, it is also possible to modules individualism, an independent part can also be integrated to form with two or more modules.
It, can be with if the function is realized and when sold or used as an independent product in the form of software function module It is stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially in other words The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention. And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic or disk.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.It should be noted that:Similar label and letter exist Similar terms are indicated in following attached drawing, therefore, once being defined in a certain Xiang Yi attached drawing, are then not required in subsequent attached drawing It is further defined and explained.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

Claims (10)

1. a kind of Web page text acquisition methods based on dynamic IP, which is characterized in that including:
Corresponding first IP address of multiple Virtual Private Server vps corresponding to crawler server in network and sliding block generation Reason server is monitored;
If the vps meets switching condition, the vps one the second IP address of dynamic generation again is indicated, by described vps pairs First IP address answered is switched to second IP address;
If the sliding block proxy server meets unavailable condition, by the sliding block proxy server labeled as unavailable, with Obtain the crawler server to the text of webpage by second IP address and available sliding block proxy server;
Wherein, the switching condition includes following any one or combinations thereof:
The use duration of corresponding first IP address of the vps is more than the first preset duration;
The access times of corresponding first IP address of the vps are more than the first preset times;
When carrying out acquisition web page contents by first IP address, there is the prompt information of frequent operation;
The unavailable condition includes following any one or combinations thereof:
The access times of the sliding block proxy server are more than the second preset times;
The use duration of the sliding block proxy server is more than the second preset duration;
The sliding block proxy server has carried out the filtering characters string operation of third preset times when being verified.
2. the method according to claim 1, wherein described cut corresponding first IP address of the vps It is changed to second IP address, including:
According to webpage information to be crawled, if judge to know second IP address as IP available, the vps is corresponding described First IP address is switched to second IP address.
3. the method according to claim 1, wherein described cut corresponding first IP address of the vps It is changed to second IP address, including:
If judging second IP address away from being not used in the preset time period before current time, by described first IP address is switched to second IP address.
4. if the method according to claim 1, wherein the sliding block proxy server meets unavailable item Part, then by the sliding block proxy server labeled as unavailable, including:
If the available number of sliding block proxy server is more than preset threshold, by the sliding block agency service of slide failure Device is labeled as unavailable.
5. the method according to claim 1, wherein the method, further includes:
Not storage and monitoring time segment is configured, the vps and the sliding block proxy server are not supervised in the not storage and monitoring time segment Control.
6. the method according to claim 1, wherein the method, further includes:
The IP call request that the crawler server is sent is received, the IP call request includes web page address to be crawled;
It is that crawler server distribution can use IP address according to the web page address and the switching condition, so that described climb Worm server can crawl the corresponding webpage to be crawled of the web page address according to IP address.
7. the method according to claim 1, wherein described cut corresponding first IP address of the vps It is changed to second IP address, including:
Corresponding first IP address of the vps is switched to second IP address based on WSGI interface.
8. a kind of Web page text acquisition device based on dynamic IP, which is characterized in that including:
Monitoring module, for corresponding first IP of multiple Virtual Private Server vps corresponding to crawler server in network Location and sliding block proxy server are monitored;
Switching module indicates vps dynamic generation one the 2nd IP again if meeting switching condition for the vps Corresponding first IP address of the vps is switched to second IP address by location;
The sliding block is acted on behalf of and is taken if meeting unavailable condition for the sliding block proxy server by unavailable labeling module Device be engaged in labeled as unavailable, so that the crawler server passes through second IP address and available sliding block proxy server pair The text of webpage is obtained;
Wherein, the switching condition includes following any one or combinations thereof:
The use duration of corresponding first IP address of the vps is more than the first preset duration;
The access times of corresponding first IP address of the vps are more than the first preset times;
When carrying out acquisition web page contents by first IP address, there is the prompt information of frequent operation;
The unavailable condition includes following any one or combinations thereof:
The access times of the sliding block proxy server are more than the second preset times;
The use duration of the sliding block proxy server is more than the second preset duration;
The sliding block proxy server has carried out the filtering characters string operation of third preset times when being verified.
9. a kind of electronic equipment, which is characterized in that including:Processor, memory and bus, wherein
The processor and the memory complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough execute the method according to claim 1 to 7.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method according to claim 1 to 7.
CN201810760579.XA 2018-07-11 2018-07-11 Webpage text acquisition method and device based on dynamic IP Active CN108900623B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810760579.XA CN108900623B (en) 2018-07-11 2018-07-11 Webpage text acquisition method and device based on dynamic IP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810760579.XA CN108900623B (en) 2018-07-11 2018-07-11 Webpage text acquisition method and device based on dynamic IP

Publications (2)

Publication Number Publication Date
CN108900623A true CN108900623A (en) 2018-11-27
CN108900623B CN108900623B (en) 2022-02-01

Family

ID=64349246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810760579.XA Active CN108900623B (en) 2018-07-11 2018-07-11 Webpage text acquisition method and device based on dynamic IP

Country Status (1)

Country Link
CN (1) CN108900623B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851690A (en) * 2019-11-14 2020-02-28 北京计算机技术及应用研究所 Method and device for collecting network information of monitoring website
KR102403941B1 (en) * 2022-02-11 2022-05-31 (주)에스투더블유 Method to crawl through the website by bypassing bot detection

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102469132A (en) * 2010-11-15 2012-05-23 北大方正集团有限公司 Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website
CN105681478A (en) * 2016-04-19 2016-06-15 北京高地信息技术有限公司 Method and device for scheduling network resources to improve network spider crawling efficiency
WO2017124024A1 (en) * 2016-01-14 2017-07-20 Sumo Logic Single click delta analysis
CN107105071A (en) * 2017-05-05 2017-08-29 北京京东金融科技控股有限公司 IP call methods and device, storage medium, electronic equipment
CN107395782A (en) * 2017-07-19 2017-11-24 北京理工大学 A kind of IP limitation controlled source information extraction methods based on agent pool
CN107704497A (en) * 2017-08-25 2018-02-16 上海壹账通金融科技有限公司 Web data crawling method, device, web data crawl platform and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102469132A (en) * 2010-11-15 2012-05-23 北大方正集团有限公司 Method and system for grabbing web pages from servers with different IPs (Internet Protocols) in website
WO2017124024A1 (en) * 2016-01-14 2017-07-20 Sumo Logic Single click delta analysis
CN105681478A (en) * 2016-04-19 2016-06-15 北京高地信息技术有限公司 Method and device for scheduling network resources to improve network spider crawling efficiency
CN107105071A (en) * 2017-05-05 2017-08-29 北京京东金融科技控股有限公司 IP call methods and device, storage medium, electronic equipment
CN107395782A (en) * 2017-07-19 2017-11-24 北京理工大学 A kind of IP limitation controlled source information extraction methods based on agent pool
CN107704497A (en) * 2017-08-25 2018-02-16 上海壹账通金融科技有限公司 Web data crawling method, device, web data crawl platform and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851690A (en) * 2019-11-14 2020-02-28 北京计算机技术及应用研究所 Method and device for collecting network information of monitoring website
KR102403941B1 (en) * 2022-02-11 2022-05-31 (주)에스투더블유 Method to crawl through the website by bypassing bot detection

Also Published As

Publication number Publication date
CN108900623B (en) 2022-02-01

Similar Documents

Publication Publication Date Title
G. Martín et al. A survey for user behavior analysis based on machine learning techniques: current models and applications
CN112148987B (en) Message pushing method based on target object activity and related equipment
CN106155298B (en) The acquisition method and device of man-machine recognition methods and device, behavioural characteristic data
CN107730389A (en) Electronic installation, insurance products recommend method and computer-readable recording medium
US20170278115A1 (en) Purchasing behavior analysis apparatus and non-transitory computer readable medium
CN107895011B (en) Session information processing method, system, storage medium and electronic equipment
CN103984673A (en) Automatic detection of fraudulent ratings/comments related to an application store
US20240005012A1 (en) Privacy score
CN110020002A (en) Querying method, device, equipment and the computer storage medium of event handling scheme
DE112016002366T5 (en) PREDICTING USER REQUIREMENTS FOR A BACKGROUND WITH A SPECIFIC CONTEXT
CN109600336A (en) Store equipment, identifying code application method and device
CN107193974A (en) Localized information based on artificial intelligence determines method and apparatus
CN107944293B (en) Fictitious assets guard method, system, equipment and storage medium
CN109858919A (en) Determination method and device, online ordering method and the device of abnormal account
CN110191183A (en) Accurate intelligent method for pushing, system, device and computer readable storage medium
CN110363653A (en) Financial service request response method, device and electronic equipment
CN111400465A (en) Generation method and device of customer service robot, electronic equipment and medium
CN107483443A (en) advertisement information processing method, client, storage medium and electronic equipment
CN110008980A (en) Identification model generation method, recognition methods, device, equipment and storage medium
CN108900623A (en) A kind of Web page text acquisition methods and device based on dynamic IP
CN108737138A (en) Service providing method and service platform
CN108259312A (en) Information issuing method, device and server
CN109408679A (en) Method, apparatus, electronic equipment and the storage medium of intelligent management application program
CN114266022A (en) Anti-fraud method and device based on data mining and electronic equipment
CN111199454B (en) Real-time user conversion evaluation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant