CN111488507A

CN111488507A - Network agent optimization method

Info

Publication number: CN111488507A
Application number: CN202010275111.9A
Authority: CN
Inventors: 孙利军
Original assignee: Xi'an Film & Television Data Evaluation Center Co ltd
Current assignee: Xi'an Film & Television Data Evaluation Center Co ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-08-04
Anticipated expiration: 2040-04-09
Also published as: CN111488507B

Abstract

The invention relates to the technical field of computers, and discloses a network agent optimization method, which comprises the following steps: acquiring an agent pool, initializing the connection power of all agents in the agent pool to 1, and taking the connection success rate as the weight of the agents; calculating the lower limit of the confidence interval of all the agent connection success rates under the confidence level by using a Wilson interval method, and taking the lower limit of the confidence interval as the adjusted connection success rate; counting the average connection power S and the average attempted connection times N of all agents; for the agent with the connection trying times lower than the average connection trying times N, smoothing processing needs to be carried out on the adjusted connection power; the adjusted connection power or the smoothed connection power is used as the weight of each agent; repeating the steps S2-S3, and counting the connection power of all the agents, the preferred method of the network agent has better effect on efficiently selecting the available agents.

Description

Network agent optimization method

Technical Field

The invention relates to the technical field of computers, in particular to a network agent optimization method.

Background

In the process of capturing mass public resources of the internet by using a web crawler, the purpose of improving the capturing efficiency and the survival cycle of the crawler is achieved by using the web proxy service, which is a common method. The network proxy service is generally provided by an independent third party and mainly comprises two modes of a random access mode and a proxy pool mode. The former provides a high quality agent with a short life cycle, acquiring a new agent each time it is used. However, if the latter agent pool mode is selected from the viewpoint of cost and security, an agent pool composed of a large number of agents is obtained in an initial state, and the agents are selected from the pool depending on a specific algorithm when used. Common selection algorithms include round-robin (traversing the pool of agents to find available agents), random (by pseudo-random number selection), weighted random (giving each agent a different weight and then making a random selection), and so on. These algorithms suffer from inefficiencies in the event of an unstable proxy connection or other conditions that may render the proxy unavailable, and cannot achieve targeted compartmentalization and fair treatment for new, dynamically replenished proxies.

Disclosure of Invention

The invention provides a network agent optimization method, which can efficiently select available agents.

The invention provides a network agent optimization method, which comprises the following steps:

s1, acquiring an agent pool, initializing the connection power of all agents in the agent pool to 1, and taking the connection success rate as the weight of the agents;

s2, according to the weight of the proxy, weighted random sampling is carried out on all the proxies, the proxy obtained by sampling is used for accessing a network target, connection with the network target is tried, if the connection fails, weighted random sampling is carried out again, the connection is skipped until the connection is successful or the set failure times are reached, the connection power S of the proxies is updated after the connection is successful, and the connection attempt times n are recorded;

s3, adjusting the connection power of all agents;

s31, calculating the lower limit of a confidence interval of the connection power of all agents under the confidence level by using a Wilson interval method, and taking the lower limit of the confidence interval as the adjusted connection success rate;

s32, counting the average connection power S and the average connection attempt times N of all agents;

s33, for the agent whose number of attempted connection is lower than the average number N of attempted connection, the adjusted connection power needs to be smoothed;

s34, using the adjusted connection power or the smoothed connection power as the weight of each agent;

and S4, repeating the processes of the steps S2-S3, and counting the connection power of all agents.

All the agents in the step S2 include the original agent in the agent pool when the agent pool is obtained and the new agent dynamically added to the agent pool later, and the connection success rate for initializing the new agent is 1.

In step S31, the formula for calculating the lower limit of the confidence interval by using the wilson interval method is:

in the formula (1), s is the connection power of a certain agent, n is the attempted connection number of the certain agent, and z_1-α2For z statistic corresponding to a certain confidence level, α is the confidence level.

In the step S33, the smoothing process is performed by bayesian averaging, and the formula of the bayesian averaging of the connection success rate is:

in the formula (2), S' is the adjusted connection power of a certain agent, S is the average connection success rate of all agents, N is the connection attempt number, and N is the average connection attempt number.

Compared with the prior art, the invention has the beneficial effects that:

the invention adjusts the success rate of the connection of different agents based on the confidence coefficient, and smoothes the power of the connection of the agents, so that the agents are uniformly compared under the credible and fair scoring standards, and the weighted random sampling is carried out on the basis, thereby realizing the high-efficiency acquisition of the available agents.

Drawings

Fig. 1 is a schematic flow chart of a preferred method of a network proxy according to the present invention.

Detailed Description

An embodiment of the present invention will be described in detail below with reference to fig. 1, but it should be understood that the scope of the present invention is not limited by the embodiment.

The invention provides a grading method for agent quality by introducing a Wilson interval and Bayesian average in a statistical theory, which can adjust the connection success rate of different agents based on confidence coefficient and carry out smooth processing on the connection power of the agents, so that new and old agents are uniformly compared under a credible and fair grading standard, and weighted random sampling is carried out on the basis, thereby realizing high-efficiency acquisition of available agents.

The method comprises the following implementation steps:

1. at t₀Constantly acquiring an agent pool, and initializing the connection success rate of all agents to be 1;

2. carrying out weighted random sampling based on the connection success rate of the agents, wherein the first sampling is uniform distribution probability extraction because the success rate of all the agents is 1 at present, accessing a network target by using the extracted agents, carrying out random sampling again if the connection fails, updating the connection power of the agents after the connection is successful, and recording the connection times;

3. at t₁Before the moment, the process of step 2 is repeated, this phase being the "warm-up phase" with the aim of being within the time window t_ΔThe connection success rate of each agent is internally counted, wherein:

t_Δ＝t₁-t₀

4. from t₁Starting at time, the following steps are performed for all agents to adjust the connection power:

(1) defining the connection success rate of a certain agent as s, and the number of attempted connections as n, z_1-α2For z statistic corresponding to a certain confidence level, the lower limit of the confidence interval of the wilson interval method is expressed as:

(2) calculating the lower limit of the confidence interval of all agents connected with power with the confidence level of 95% by using a Wilson interval method, and if a certain agent is in a time window t_ΔAnd (3) trying to connect for 50 times in total, wherein the success rate is 40 times, the connection success rate is 0.80, the upper limit and the lower limit of a confidence interval are respectively 0.888 and 0.670 which are calculated according to a 95% horizontal Wilson interval method, and the lower limit is taken as the success rate after adjustment.

(3) And counting the average connection power S and the average connection attempt number N of all the agents.

(4) Defining the adjusted success rate of a certain agent as s', and the number of attempted connections as n, then the Bayesian average formula of the connection success rate is:

(5) for the agents with the connection attempt times lower than the average times, bayesian average smoothing processing needs to be performed on the adjusted success rate, and if the adjusted success rate of a certain agent after 30 connection attempts is 0.670, the average connection power S of all the agents in the agent pool is 0.6, and the average connection attempt time N is 50, the final success rate, that is, the quality score obtained by smoothing processing is 0.626.

5. And (4) obtaining a connection success rate which is subjected to credibility adjustment and smoothing processing based on the step 4, taking the success rate as the weight of each agent, using weighted random sampling to extract the agents, accessing the network target by using the agents, performing random sampling again if the connection fails, updating the connection times after the connection is successful, updating the connection power of all the agents according to the step 4, distributing higher weight if the connection power is higher, and more possibly selecting the agents as links for accessing the target next time.

Comparing the existing agents in the agent pool, if only two agents exist, if the connection success rate of the agent a is 99% and the agent b is 1%, the weight is distributed according to the connection power, 99% of the agents a can be selected, but after the adjustment of the method, the connection power of the agents a is reduced to 80%, and the connection success rate of the agents b is increased to 20%, the agents a can be selected 80%, the agents b are still low in possibility of being selected, but the probability of being selected is improved compared with the previous method.

For the dynamically added new agent, the new agent is obtained from the outside of the agent pool, and the new agent is placed in the agent pool after the new agent is obtained. Initializing the connection success rate to be 1, and participating in step 2 and subsequent steps to complete the adjustment and smoothing of the connection power together with all agents in the agent pool.

The invention can treat the new agent added in increment fairly, avoids overlarge advantages of the old agent in random sampling due to accumulation of historical access times, and has better effect on selecting the available agent with high efficiency.

The invention belongs to the field of computers. The method uses a more credible statistical means to adjust the actual connection success rate, so that the actual connection power based on statistics becomes a better and more excellent index, and belongs to a technology for performing quality sequencing and optimized selection on network agents by using data analysis and statistical theories. The invention provides a method for grading the quality of an agent by introducing a Wilson interval and Bayesian average in a statistical theory and based on the successful or failed connection history of the agent in a certain past time window, and the following aims can be achieved by using the method:

1. and the credibility of the availability score of the stock agents is improved.

2. Increasing the fairness of scoring incremental agent availability.

3. And efficiently selecting available agents in the agent pool.

After the agent pool is used for completing the warm-up stage, the connection quality of each agent is quantitatively measured by using the confidence degree adjustment and the smoothing treatment, so that the problem that the success rate is not reliable due to different agent connection times is solved, and the new agent added in increments can be treated fairly, so that the excessive advantages of the old agent in random sampling due to the accumulation of historical access times are avoided. By combining the advantages, the method has a good effect on efficiently selecting the available agents.

The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. A preferred method of network proxying, comprising the steps of:

s3, adjusting the connection power of all agents;

and S4, repeating the steps S2-S3, and counting the connection power of all agents.

2. The method for optimizing network proxy of claim 1, wherein said all proxies in step S2 include an original proxy in the proxy pool when acquiring the proxy pool and a new proxy dynamically joining the proxy pool later, and the connection success rate for initializing the new proxy is 1.

3. The method for optimizing a network proxy of claim 1, wherein the formula for calculating the lower limit of the confidence interval by using the wilson interval method in step S31 is as follows:

in the formula (1), s is the connection power of a certain agent, n is the attempted connection number of the certain agent, and z_1-α/2For z statistic corresponding to a certain confidence level, α is the confidence level.

4. The method as claimed in claim 1, wherein the smoothing process in step S33 is a smoothing process by a bayesian average, and the formula of the bayesian average of connection success rate is: