CN111488507A - Network agent optimization method - Google Patents

Network agent optimization method Download PDF

Info

Publication number
CN111488507A
CN111488507A CN202010275111.9A CN202010275111A CN111488507A CN 111488507 A CN111488507 A CN 111488507A CN 202010275111 A CN202010275111 A CN 202010275111A CN 111488507 A CN111488507 A CN 111488507A
Authority
CN
China
Prior art keywords
connection
agents
agent
average
proxy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010275111.9A
Other languages
Chinese (zh)
Other versions
CN111488507B (en
Inventor
孙利军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Film & Television Data Evaluation Center Co ltd
Original Assignee
Xi'an Film & Television Data Evaluation Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Film & Television Data Evaluation Center Co ltd filed Critical Xi'an Film & Television Data Evaluation Center Co ltd
Priority to CN202010275111.9A priority Critical patent/CN111488507B/en
Publication of CN111488507A publication Critical patent/CN111488507A/en
Application granted granted Critical
Publication of CN111488507B publication Critical patent/CN111488507B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention relates to the technical field of computers, and discloses a network agent optimization method, which comprises the following steps: acquiring an agent pool, initializing the connection power of all agents in the agent pool to 1, and taking the connection success rate as the weight of the agents; calculating the lower limit of the confidence interval of all the agent connection success rates under the confidence level by using a Wilson interval method, and taking the lower limit of the confidence interval as the adjusted connection success rate; counting the average connection power S and the average attempted connection times N of all agents; for the agent with the connection trying times lower than the average connection trying times N, smoothing processing needs to be carried out on the adjusted connection power; the adjusted connection power or the smoothed connection power is used as the weight of each agent; repeating the steps S2-S3, and counting the connection power of all the agents, the preferred method of the network agent has better effect on efficiently selecting the available agents.

Description

Network agent optimization method
Technical Field
The invention relates to the technical field of computers, in particular to a network agent optimization method.
Background
In the process of capturing mass public resources of the internet by using a web crawler, the purpose of improving the capturing efficiency and the survival cycle of the crawler is achieved by using the web proxy service, which is a common method. The network proxy service is generally provided by an independent third party and mainly comprises two modes of a random access mode and a proxy pool mode. The former provides a high quality agent with a short life cycle, acquiring a new agent each time it is used. However, if the latter agent pool mode is selected from the viewpoint of cost and security, an agent pool composed of a large number of agents is obtained in an initial state, and the agents are selected from the pool depending on a specific algorithm when used. Common selection algorithms include round-robin (traversing the pool of agents to find available agents), random (by pseudo-random number selection), weighted random (giving each agent a different weight and then making a random selection), and so on. These algorithms suffer from inefficiencies in the event of an unstable proxy connection or other conditions that may render the proxy unavailable, and cannot achieve targeted compartmentalization and fair treatment for new, dynamically replenished proxies.
Disclosure of Invention
The invention provides a network agent optimization method, which can efficiently select available agents.
The invention provides a network agent optimization method, which comprises the following steps:
s1, acquiring an agent pool, initializing the connection power of all agents in the agent pool to 1, and taking the connection success rate as the weight of the agents;
s2, according to the weight of the proxy, weighted random sampling is carried out on all the proxies, the proxy obtained by sampling is used for accessing a network target, connection with the network target is tried, if the connection fails, weighted random sampling is carried out again, the connection is skipped until the connection is successful or the set failure times are reached, the connection power S of the proxies is updated after the connection is successful, and the connection attempt times n are recorded;
s3, adjusting the connection power of all agents;
s31, calculating the lower limit of a confidence interval of the connection power of all agents under the confidence level by using a Wilson interval method, and taking the lower limit of the confidence interval as the adjusted connection success rate;
s32, counting the average connection power S and the average connection attempt times N of all agents;
s33, for the agent whose number of attempted connection is lower than the average number N of attempted connection, the adjusted connection power needs to be smoothed;
s34, using the adjusted connection power or the smoothed connection power as the weight of each agent;
and S4, repeating the processes of the steps S2-S3, and counting the connection power of all agents.
All the agents in the step S2 include the original agent in the agent pool when the agent pool is obtained and the new agent dynamically added to the agent pool later, and the connection success rate for initializing the new agent is 1.
In step S31, the formula for calculating the lower limit of the confidence interval by using the wilson interval method is:
Figure BDA0002444491170000021
in the formula (1), s is the connection power of a certain agent, n is the attempted connection number of the certain agent, and z1-α2For z statistic corresponding to a certain confidence level, α is the confidence level.
In the step S33, the smoothing process is performed by bayesian averaging, and the formula of the bayesian averaging of the connection success rate is:
Figure BDA0002444491170000022
in the formula (2), S' is the adjusted connection power of a certain agent, S is the average connection success rate of all agents, N is the connection attempt number, and N is the average connection attempt number.
Compared with the prior art, the invention has the beneficial effects that:
the invention adjusts the success rate of the connection of different agents based on the confidence coefficient, and smoothes the power of the connection of the agents, so that the agents are uniformly compared under the credible and fair scoring standards, and the weighted random sampling is carried out on the basis, thereby realizing the high-efficiency acquisition of the available agents.
Drawings
Fig. 1 is a schematic flow chart of a preferred method of a network proxy according to the present invention.
Detailed Description
An embodiment of the present invention will be described in detail below with reference to fig. 1, but it should be understood that the scope of the present invention is not limited by the embodiment.
The invention provides a grading method for agent quality by introducing a Wilson interval and Bayesian average in a statistical theory, which can adjust the connection success rate of different agents based on confidence coefficient and carry out smooth processing on the connection power of the agents, so that new and old agents are uniformly compared under a credible and fair grading standard, and weighted random sampling is carried out on the basis, thereby realizing high-efficiency acquisition of available agents.
The method comprises the following implementation steps:
1. at t0Constantly acquiring an agent pool, and initializing the connection success rate of all agents to be 1;
2. carrying out weighted random sampling based on the connection success rate of the agents, wherein the first sampling is uniform distribution probability extraction because the success rate of all the agents is 1 at present, accessing a network target by using the extracted agents, carrying out random sampling again if the connection fails, updating the connection power of the agents after the connection is successful, and recording the connection times;
3. at t1Before the moment, the process of step 2 is repeated, this phase being the "warm-up phase" with the aim of being within the time window tΔThe connection success rate of each agent is internally counted, wherein:
tΔ=t1-t0
4. from t1Starting at time, the following steps are performed for all agents to adjust the connection power:
(1) defining the connection success rate of a certain agent as s, and the number of attempted connections as n, z1-α2For z statistic corresponding to a certain confidence level, the lower limit of the confidence interval of the wilson interval method is expressed as:
Figure BDA0002444491170000041
(2) calculating the lower limit of the confidence interval of all agents connected with power with the confidence level of 95% by using a Wilson interval method, and if a certain agent is in a time window tΔAnd (3) trying to connect for 50 times in total, wherein the success rate is 40 times, the connection success rate is 0.80, the upper limit and the lower limit of a confidence interval are respectively 0.888 and 0.670 which are calculated according to a 95% horizontal Wilson interval method, and the lower limit is taken as the success rate after adjustment.
(3) And counting the average connection power S and the average connection attempt number N of all the agents.
(4) Defining the adjusted success rate of a certain agent as s', and the number of attempted connections as n, then the Bayesian average formula of the connection success rate is:
Figure BDA0002444491170000042
(5) for the agents with the connection attempt times lower than the average times, bayesian average smoothing processing needs to be performed on the adjusted success rate, and if the adjusted success rate of a certain agent after 30 connection attempts is 0.670, the average connection power S of all the agents in the agent pool is 0.6, and the average connection attempt time N is 50, the final success rate, that is, the quality score obtained by smoothing processing is 0.626.
5. And (4) obtaining a connection success rate which is subjected to credibility adjustment and smoothing processing based on the step 4, taking the success rate as the weight of each agent, using weighted random sampling to extract the agents, accessing the network target by using the agents, performing random sampling again if the connection fails, updating the connection times after the connection is successful, updating the connection power of all the agents according to the step 4, distributing higher weight if the connection power is higher, and more possibly selecting the agents as links for accessing the target next time.
Comparing the existing agents in the agent pool, if only two agents exist, if the connection success rate of the agent a is 99% and the agent b is 1%, the weight is distributed according to the connection power, 99% of the agents a can be selected, but after the adjustment of the method, the connection power of the agents a is reduced to 80%, and the connection success rate of the agents b is increased to 20%, the agents a can be selected 80%, the agents b are still low in possibility of being selected, but the probability of being selected is improved compared with the previous method.
For the dynamically added new agent, the new agent is obtained from the outside of the agent pool, and the new agent is placed in the agent pool after the new agent is obtained. Initializing the connection success rate to be 1, and participating in step 2 and subsequent steps to complete the adjustment and smoothing of the connection power together with all agents in the agent pool.
The invention can treat the new agent added in increment fairly, avoids overlarge advantages of the old agent in random sampling due to accumulation of historical access times, and has better effect on selecting the available agent with high efficiency.
The invention belongs to the field of computers. The method uses a more credible statistical means to adjust the actual connection success rate, so that the actual connection power based on statistics becomes a better and more excellent index, and belongs to a technology for performing quality sequencing and optimized selection on network agents by using data analysis and statistical theories. The invention provides a method for grading the quality of an agent by introducing a Wilson interval and Bayesian average in a statistical theory and based on the successful or failed connection history of the agent in a certain past time window, and the following aims can be achieved by using the method:
1. and the credibility of the availability score of the stock agents is improved.
2. Increasing the fairness of scoring incremental agent availability.
3. And efficiently selecting available agents in the agent pool.
After the agent pool is used for completing the warm-up stage, the connection quality of each agent is quantitatively measured by using the confidence degree adjustment and the smoothing treatment, so that the problem that the success rate is not reliable due to different agent connection times is solved, and the new agent added in increments can be treated fairly, so that the excessive advantages of the old agent in random sampling due to the accumulation of historical access times are avoided. By combining the advantages, the method has a good effect on efficiently selecting the available agents.
The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims (4)

1. A preferred method of network proxying, comprising the steps of:
s1, acquiring an agent pool, initializing the connection power of all agents in the agent pool to 1, and taking the connection success rate as the weight of the agents;
s2, according to the weight of the proxy, weighted random sampling is carried out on all the proxies, the proxy obtained by sampling is used for accessing a network target, connection with the network target is tried, if the connection fails, weighted random sampling is carried out again, the connection is skipped until the connection is successful or the set failure times are reached, the connection power S of the proxies is updated after the connection is successful, and the connection attempt times n are recorded;
s3, adjusting the connection power of all agents;
s31, calculating the lower limit of a confidence interval of the connection power of all agents under the confidence level by using a Wilson interval method, and taking the lower limit of the confidence interval as the adjusted connection success rate;
s32, counting the average connection power S and the average connection attempt times N of all agents;
s33, for the agent whose number of attempted connection is lower than the average number N of attempted connection, the adjusted connection power needs to be smoothed;
s34, using the adjusted connection power or the smoothed connection power as the weight of each agent;
and S4, repeating the steps S2-S3, and counting the connection power of all agents.
2. The method for optimizing network proxy of claim 1, wherein said all proxies in step S2 include an original proxy in the proxy pool when acquiring the proxy pool and a new proxy dynamically joining the proxy pool later, and the connection success rate for initializing the new proxy is 1.
3. The method for optimizing a network proxy of claim 1, wherein the formula for calculating the lower limit of the confidence interval by using the wilson interval method in step S31 is as follows:
Figure FDA0002444491160000011
in the formula (1), s is the connection power of a certain agent, n is the attempted connection number of the certain agent, and z1-α/2For z statistic corresponding to a certain confidence level, α is the confidence level.
4. The method as claimed in claim 1, wherein the smoothing process in step S33 is a smoothing process by a bayesian average, and the formula of the bayesian average of connection success rate is:
Figure FDA0002444491160000021
in the formula (2), S' is the adjusted connection power of a certain agent, S is the average connection success rate of all agents, N is the connection attempt number, and N is the average connection attempt number.
CN202010275111.9A 2020-04-09 2020-04-09 Optimization method of network proxy Active CN111488507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010275111.9A CN111488507B (en) 2020-04-09 2020-04-09 Optimization method of network proxy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010275111.9A CN111488507B (en) 2020-04-09 2020-04-09 Optimization method of network proxy

Publications (2)

Publication Number Publication Date
CN111488507A true CN111488507A (en) 2020-08-04
CN111488507B CN111488507B (en) 2023-05-23

Family

ID=71812676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010275111.9A Active CN111488507B (en) 2020-04-09 2020-04-09 Optimization method of network proxy

Country Status (1)

Country Link
CN (1) CN111488507B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310012A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Distributed web crawler system
CN107832355A (en) * 2017-10-23 2018-03-23 北京金堤科技有限公司 The method and device that a kind of agency of crawlers obtains
CN108595492A (en) * 2018-03-15 2018-09-28 腾讯科技(深圳)有限公司 Method for pushing and device, storage medium, the electronic device of content
CN110147271A (en) * 2019-05-15 2019-08-20 重庆八戒传媒有限公司 Promote the method, apparatus and computer readable storage medium of crawler agent quality

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310012A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Distributed web crawler system
CN107832355A (en) * 2017-10-23 2018-03-23 北京金堤科技有限公司 The method and device that a kind of agency of crawlers obtains
CN108595492A (en) * 2018-03-15 2018-09-28 腾讯科技(深圳)有限公司 Method for pushing and device, storage medium, the electronic device of content
CN110147271A (en) * 2019-05-15 2019-08-20 重庆八戒传媒有限公司 Promote the method, apparatus and computer readable storage medium of crawler agent quality

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RAYMOND S.T. LEE,等: "Agent-Based Web Content Engagement Time (WCET) Analyzer on e-Publication System", 《2009 NINTH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS》 *
徐林龙,等: "一种基于威尔逊区间的商品好评率排名算法", 《计算机技术与发展》 *

Also Published As

Publication number Publication date
CN111488507B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN112668726A (en) Personalized federal learning method with efficient communication and privacy protection
CN112181666B (en) Equipment assessment and federal learning importance aggregation method based on edge intelligence
CN109947545B (en) Task unloading and migration decision method based on user mobility
CN109167787B (en) resource optimization method for safety calculation unloading in mobile edge calculation network
CN109388492B (en) Mobile block chain optimization calculation force distribution method based on simulated annealing in multi-edge calculation server scene
CN112910861A (en) Group authentication and segmented authentication-based authentication method for terminal equipment of power internet of things
CN111031547A (en) Multi-user D2D communication resource allocation method based on spectrum allocation and power control
CN108924799B (en) Resource allocation algorithm for D2D communication in cellular network
CN110689345A (en) Unlicensed blockchain consensus method and system for adjusting block weights, and P2P network
CN115865378B (en) Streaming media real-time certification and verification method based on blockchain
CN115796271A (en) Federal learning method based on client selection and gradient compression
CN107682316B (en) Method for generating dynamic password sending strategy and method for sending dynamic password
CN110445944B (en) Method and system for preventing telephone disturbance of call center
CN109905863B (en) Relay access method of distributed cooperative communication based on block chain storage
CN111488507A (en) Network agent optimization method
CN112596910B (en) Cloud computing resource scheduling method in multi-user MEC system
CN110996366B (en) Weight determination method in vertical handover of heterogeneous private network
CN111445329A (en) Block chain transaction processing method and system
CN112491577B (en) Bandwidth acceleration method and system
CN112488324B (en) Version control-based distributed machine learning model updating method
CN115802380A (en) Resource allocation method and device for cognitive industry Internet of things in dynamic uncertain scene
CN104935638A (en) P2P downloading algorithm based on blocking switching servers
CN110868304B (en) PCC strategy issuing method and system
CN112416577A (en) Cooperative intelligent calculation and distribution method suitable for block chain workload certification
CN110069719B (en) Internet environment-oriented behavior prediction method and prediction system thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant