CN112287200A

CN112287200A - Multi-target-oriented social public safety risk data acquisition method

Info

Publication number: CN112287200A
Application number: CN202011312018.7A
Authority: CN
Inventors: 王慧娟; 王晓峰; 印晓天
Original assignee: First Research Institute of Ministry of Public Security
Current assignee: First Research Institute of Ministry of Public Security
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-01-29
Anticipated expiration: 2040-11-20
Also published as: CN112287200B

Abstract

The invention discloses a social public safety risk data acquisition method facing multiple targets, which comprises the steps of firstly determining multiple crawling targets and making an extraction rule for each target, then sequencing all subtasks by adopting displacement mapping to form multiple short-time sequences, and executing each subtask in sequence by adopting a sliding window method with a self-adaptive size until the crawling is finished. The method provides a crawler method based on the permutation mapping and the self-adaptive rate, and the efficiency of the crawler can be improved as much as possible under the condition of ensuring the safety of the crawler.

Description

Multi-target-oriented social public safety risk data acquisition method

Technical Field

The invention relates to the technical field of data processing, in particular to a social public safety risk data acquisition method facing multiple targets.

Background

With the advent of the big data era, the data volume in the internet is increased rapidly, and the data volume contains a lot of data related to social public security risks, the data volume is large, the dimensionality is high, the data source is wide, the traditional collection method aiming at common internet information is difficult to meet the current requirements, and therefore how to quickly collect the social public security risk data becomes important.

The social public safety risk data have important significance for emergency treatment and risk prevention and control of the country and the government, and the main sources of the data are as follows: social network sites, mainstream news sites, government sounding sites and public opinion information publishing sites. At present, aiming at different data sources and targets, a web crawler is used, data crawling rules are designed to obtain target internet resources, various network resources and related links thereof can be selectively accessed, and finally certain analysis and filtering are carried out to extract effective information in the network resources and the effective information is permanently stored so as to facilitate subsequent inquiry and use.

Because each search engine takes certain protection measures to the server of the search engine, namely, the access times of the user to the search engine in a certain time are limited. In order not to infringe the search engine protection protocol, an effective method is to limit the number of accesses per unit time, but the method reduces the efficiency of the crawler, and in the present environment with huge data volume, huge time consumption is caused, and the data is easy to lose its timeliness. Most of the traditional crawler methods only consider the problem of crawling efficiency, but most of the websites adopt a strong anti-crawling mechanism at present, so that the safety problem of the crawler is caused.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a multi-target-oriented social public security risk data acquisition method, which can improve the efficiency of a crawler as much as possible under the condition of ensuring the security of the crawler.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-target-oriented social public safety risk data acquisition method comprises the following steps:

s1, crawling the first Q pieces of information of each target website, wherein each website is a crawling target, and each piece of information is a subtask; determining target websites to be crawled, and setting the number of the target websites as S; defining the subtask matrix as:

wherein, a_xyA y-th subtask representing an x-th target web site, x being 1,2,.., S, y being 1, 2.., Q;

and (3) formulating a corresponding extraction mode for each target website, wherein a matrix for defining the extraction modes is as follows:

F＝[f₁ f₂ f₃ f₄...f_S]；

wherein f is_xX is 1, 2.., and S represents an extraction mode of the xth target website;

s2, defining permutations

Mapping and combining each row in the subtask matrix A to form a new task queue S_a：

S_a＝[A₁π,A₂π,…,A_Qπ]

＝[s₁,s₂,…,s_L]

Wherein k is₁,k₂，…，k_SIs a rearrangement of the sequence numbers 1,2, …, S, A_wRepresents w-th row in the subtask matrix a, w being 1, 2. A. the_wPi is the subtask frame structure, s₁,s₂,...,s_LFor the mapped reordered subtasks, L ═ S × Q;

s3, starting crawling: for the same subtask s_i1,2, L, at time t_iSending a request, at time t_i+Δt_iReceiving a server response for a time period Δ t_iTo wait for response time, use is made of Δ t_iSending a plurality of requests again to form a request window; the request window sequentially slides from left to right on the task queue of step S2, and all the subtasks are allowed to be executed only when the subtasks are within the request window interval, so that all the task execution intervals are:

wherein t is_siAs a subtask s_iThe occupied time interval is tau which is the time required by the system to switch tasks; the total time consumption was:

t_sum＝799τ+max(Δt₁-799τ,Δt₂-798τ,…,Δt_L)；

and S4, after the crawling is finished, filtering the page content of the corresponding target website by adopting the information extraction mode F specified in the step S1, and storing the effective information into a database or outputting the effective information as an excel form.

Further, in step S3, the request window size window (t) increases exponentially with a as a base from 0 to reach a predetermined threshold th₁Then, the linear growth is converted into linear growth by taking k as a coefficient; if t is t_mAt the moment, the access of the subtask is rejected, the size window (t) of the sliding window is set to 0, and the threshold th is set_nIs set as the original threshold th_n-11/2, reentry into exponential growth; if t is equal to t_mAt this point, an access timeout occurs, which reduces both the size of the request window (t) and the threshold size to 1/2 and keeps increasing linearly.

The invention has the beneficial effects that: the method provides a crawler method based on the permutation mapping and the self-adaptive rate, and the efficiency of the crawler can be improved as much as possible under the condition of ensuring the safety of the crawler.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a structure of a request window according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a window size change under exception handling based on speed control according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical solution, and the detailed implementation and the specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.

The embodiment provides a social public safety risk data acquisition method facing multiple targets. In this embodiment, the first 200 pieces of information of the main page of each large main news website need to be crawled, wherein each website is a crawling target, and each piece of information is a subtask. As shown in fig. 1, the method comprises the steps of:

s1, firstly, determining a target website to be crawled. The mainstream news websites selected in the embodiment include four subtasks, namely, a civil network, a newcastle, a china international broadcasting station and a china daily newspaper. Defining the subtask matrix as:

wherein, a_djJ subtask of the human network, a_sjDenotes the j-th subtask of Xinhua corporation, a_tjRefers to the jth subtask of China International broadcasting station, a_ejThe j is the jth subtask in the Chinese diary, j represents a sequence number, and j is (1, 2.. multidot., 200).

And (3) formulating a corresponding extraction mode for each target website, such as a tag positioning mode, a range searching mode and the like, and specifically customizing the extraction mode. Defining a method matrix as follows:

F＝[f_a f_b f_c f_d]

wherein f is_aHuman net extraction method, f_bIn the Xinhua extraction mode, f_cFor the Chinese International broadcasting station, f_dIs an extraction mode of Chinese daily newspapers.

S2, defining permutations

In the formula k₁、k₂、k₃、k₄The sequence numbers 1,2, 3 and 4 are rearranged, and mapping combination is carried out on each row in the subtask matrix A to form a new task queue S_a：

S_a＝[A₁π,A₂π,…,A₂₀₀π]

＝[s₁,s₂,…,s₈₀₀]

Wherein A is_wDenotes the w-th row in the subtask matrix a, w being 1, 2., 200; a. the_wPi is the subtask frame structure, s₁,s₂,...,s₈₀₀The sub-tasks reordered for mapping. Permutation pi maps each row of subtasks into a frame structure randomly, and ensures that the subtasks of a single target are distributed uniformly in a time domain.

S3, in the embodiment, a distributed crawler mode is adopted, namely, the same subtask S is subjected to_iAt time t_iSending a request, at time t_i+Δt_iA server response is received. Thus, the time period Δ t_iTo wait for response time, Δ t may be utilized_iMultiple requests are sent again, forming a request window, as shown in fig. 2. The request window sequentially slides from left to right on the task queue of step S2, and all the subtasks are allowed to be executed only when the subtasks are within the request window interval, so that all the task execution intervals are:

wherein t is_siAs a subtask s_iOccupied time interval, tau is the time needed by system switching task, tau is far less than delta t_si. The total time consumption was:

t_sum＝799τ+max(Δt₁-799τ,Δt₂-798τ,…,Δt₈₀₀)。

as shown in fig. 1, if an execution failure occurs during each task execution, re-execution is allowed, but the task is not attempted again after T attempts are still failed, and the next task is continued.

In the task execution process, the conditions that the subtask request is rejected by the server, the subtask is overtime and the like may occur, and an exception handling mode based on speed control is designed in the embodiment, and the whole work is ensured to be in a safe and stable state by dynamically adjusting the size of the request window. As shown in fig. 3, the process can be abstracted as the following function:

the request window size window (t) increases exponentially with a as the base from 0 to reach a predetermined threshold th₁Then, the linear growth is converted into linear growth by taking k as a coefficient; let t be t_mTime (shown as t in the above equation)₂Moment) the access of the subtask is rejected, the size of the sliding window (t) is set to 0, and the threshold th is set_n(th in the above formula₂) Is set as the original threshold th_n-1(th in the above formula₁) 1/2, re-enters exponential growth. Let t be t_mTime (shown as t in the above equation)₄Time of day) an access timeout occurs, and the size of the request window (t) and the size of the threshold are both reduced to 1/2 and kept linearly increasing.

S4, filtering the page content of the corresponding target website by adopting the information extraction mode F specified in the step S1, and saving the effective information in a database or outputting the effective information as an excel table.

Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. A multi-target-oriented social public safety risk data acquisition method is characterized by comprising the following steps:

F＝[f₁ f₂ f₃ f₄...f_S]；

s2, defining permutations

S_a＝[A₁π,A₂π,…,A_Qπ]

＝[s₁,s₂,…,s_L]

t_sum＝799τ+max(Δt₁-799τ,Δt₂-798τ,…,Δt_L)；

2. The method according to claim 1, wherein in step S3, the request window size window (t) is increased exponentially with a as a base number starting from 0 to reach a predetermined threshold th₁Then, the linear growth is converted into linear growth by taking k as a coefficient; if t is t_mAt the moment, the access of the subtask is rejected, the size window (t) of the sliding window is set to 0, and the threshold th is set_nIs set as the original threshold th_n-11/2, reentry into exponential growth; if t is equal to t_mAt this point, an access timeout occurs, which reduces both the size of the request window (t) and the threshold size to 1/2 and keeps increasing linearly.