CN112287200A - Multi-target-oriented social public safety risk data acquisition method - Google Patents

Multi-target-oriented social public safety risk data acquisition method Download PDF

Info

Publication number
CN112287200A
CN112287200A CN202011312018.7A CN202011312018A CN112287200A CN 112287200 A CN112287200 A CN 112287200A CN 202011312018 A CN202011312018 A CN 202011312018A CN 112287200 A CN112287200 A CN 112287200A
Authority
CN
China
Prior art keywords
subtask
target
crawling
window
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011312018.7A
Other languages
Chinese (zh)
Other versions
CN112287200B (en
Inventor
王慧娟
王晓峰
印晓天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
First Research Institute of Ministry of Public Security
Original Assignee
First Research Institute of Ministry of Public Security
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by First Research Institute of Ministry of Public Security filed Critical First Research Institute of Ministry of Public Security
Priority to CN202011312018.7A priority Critical patent/CN112287200B/en
Publication of CN112287200A publication Critical patent/CN112287200A/en
Application granted granted Critical
Publication of CN112287200B publication Critical patent/CN112287200B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a social public safety risk data acquisition method facing multiple targets, which comprises the steps of firstly determining multiple crawling targets and making an extraction rule for each target, then sequencing all subtasks by adopting displacement mapping to form multiple short-time sequences, and executing each subtask in sequence by adopting a sliding window method with a self-adaptive size until the crawling is finished. The method provides a crawler method based on the permutation mapping and the self-adaptive rate, and the efficiency of the crawler can be improved as much as possible under the condition of ensuring the safety of the crawler.

Description

Multi-target-oriented social public safety risk data acquisition method
Technical Field
The invention relates to the technical field of data processing, in particular to a social public safety risk data acquisition method facing multiple targets.
Background
With the advent of the big data era, the data volume in the internet is increased rapidly, and the data volume contains a lot of data related to social public security risks, the data volume is large, the dimensionality is high, the data source is wide, the traditional collection method aiming at common internet information is difficult to meet the current requirements, and therefore how to quickly collect the social public security risk data becomes important.
The social public safety risk data have important significance for emergency treatment and risk prevention and control of the country and the government, and the main sources of the data are as follows: social network sites, mainstream news sites, government sounding sites and public opinion information publishing sites. At present, aiming at different data sources and targets, a web crawler is used, data crawling rules are designed to obtain target internet resources, various network resources and related links thereof can be selectively accessed, and finally certain analysis and filtering are carried out to extract effective information in the network resources and the effective information is permanently stored so as to facilitate subsequent inquiry and use.
Because each search engine takes certain protection measures to the server of the search engine, namely, the access times of the user to the search engine in a certain time are limited. In order not to infringe the search engine protection protocol, an effective method is to limit the number of accesses per unit time, but the method reduces the efficiency of the crawler, and in the present environment with huge data volume, huge time consumption is caused, and the data is easy to lose its timeliness. Most of the traditional crawler methods only consider the problem of crawling efficiency, but most of the websites adopt a strong anti-crawling mechanism at present, so that the safety problem of the crawler is caused.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a multi-target-oriented social public security risk data acquisition method, which can improve the efficiency of a crawler as much as possible under the condition of ensuring the security of the crawler.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-target-oriented social public safety risk data acquisition method comprises the following steps:
s1, crawling the first Q pieces of information of each target website, wherein each website is a crawling target, and each piece of information is a subtask; determining target websites to be crawled, and setting the number of the target websites as S; defining the subtask matrix as:
Figure BDA0002790113550000021
wherein, axyA y-th subtask representing an x-th target web site, x being 1,2,.., S, y being 1, 2.., Q;
and (3) formulating a corresponding extraction mode for each target website, wherein a matrix for defining the extraction modes is as follows:
F=[f1 f2 f3 f4...fS];
wherein f isxX is 1, 2.., and S represents an extraction mode of the xth target website;
s2, defining permutations
Figure BDA0002790113550000031
Mapping and combining each row in the subtask matrix A to form a new task queue Sa
Sa=[A1π,A2π,…,AQπ]
=[s1,s2,…,sL]
Wherein k is1,k2,…,kSIs a rearrangement of the sequence numbers 1,2, …, S, AwRepresents w-th row in the subtask matrix a, w being 1, 2. A. thewPi is the subtask frame structure, s1,s2,...,sLFor the mapped reordered subtasks, L ═ S × Q;
s3, starting crawling: for the same subtask si1,2, L, at time tiSending a request, at time ti+ΔtiReceiving a server response for a time period Δ tiTo wait for response time, use is made of Δ tiSending a plurality of requests again to form a request window; the request window sequentially slides from left to right on the task queue of step S2, and all the subtasks are allowed to be executed only when the subtasks are within the request window interval, so that all the task execution intervals are:
Figure BDA0002790113550000032
wherein t issiAs a subtask siThe occupied time interval is tau which is the time required by the system to switch tasks; the total time consumption was:
tsum=799τ+max(Δt1-799τ,Δt2-798τ,…,ΔtL);
and S4, after the crawling is finished, filtering the page content of the corresponding target website by adopting the information extraction mode F specified in the step S1, and storing the effective information into a database or outputting the effective information as an excel form.
Further, in step S3, the request window size window (t) increases exponentially with a as a base from 0 to reach a predetermined threshold th1Then, the linear growth is converted into linear growth by taking k as a coefficient; if t is tmAt the moment, the access of the subtask is rejected, the size window (t) of the sliding window is set to 0, and the threshold th is setnIs set as the original threshold thn-11/2, reentry into exponential growth; if t is equal to tmAt this point, an access timeout occurs, which reduces both the size of the request window (t) and the threshold size to 1/2 and keeps increasing linearly.
The invention has the beneficial effects that: the method provides a crawler method based on the permutation mapping and the self-adaptive rate, and the efficiency of the crawler can be improved as much as possible under the condition of ensuring the safety of the crawler.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a structure of a request window according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a window size change under exception handling based on speed control according to an embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical solution, and the detailed implementation and the specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.
The embodiment provides a social public safety risk data acquisition method facing multiple targets. In this embodiment, the first 200 pieces of information of the main page of each large main news website need to be crawled, wherein each website is a crawling target, and each piece of information is a subtask. As shown in fig. 1, the method comprises the steps of:
s1, firstly, determining a target website to be crawled. The mainstream news websites selected in the embodiment include four subtasks, namely, a civil network, a newcastle, a china international broadcasting station and a china daily newspaper. Defining the subtask matrix as:
Figure BDA0002790113550000051
wherein, adjJ subtask of the human network, asjDenotes the j-th subtask of Xinhua corporation, atjRefers to the jth subtask of China International broadcasting station, aejThe j is the jth subtask in the Chinese diary, j represents a sequence number, and j is (1, 2.. multidot., 200).
And (3) formulating a corresponding extraction mode for each target website, such as a tag positioning mode, a range searching mode and the like, and specifically customizing the extraction mode. Defining a method matrix as follows:
F=[fa fb fc fd]
wherein f isaHuman net extraction method, fbIn the Xinhua extraction mode, fcFor the Chinese International broadcasting station, fdIs an extraction mode of Chinese daily newspapers.
S2, defining permutations
Figure BDA0002790113550000052
In the formula k1、k2、k3、k4The sequence numbers 1,2, 3 and 4 are rearranged, and mapping combination is carried out on each row in the subtask matrix A to form a new task queue Sa
Sa=[A1π,A2π,…,A200π]
=[s1,s2,…,s800]
Wherein A iswDenotes the w-th row in the subtask matrix a, w being 1, 2., 200; a. thewPi is the subtask frame structure, s1,s2,...,s800The sub-tasks reordered for mapping. Permutation pi maps each row of subtasks into a frame structure randomly, and ensures that the subtasks of a single target are distributed uniformly in a time domain.
S3, in the embodiment, a distributed crawler mode is adopted, namely, the same subtask S is subjected toiAt time tiSending a request, at time ti+ΔtiA server response is received. Thus, the time period Δ tiTo wait for response time, Δ t may be utilizediMultiple requests are sent again, forming a request window, as shown in fig. 2. The request window sequentially slides from left to right on the task queue of step S2, and all the subtasks are allowed to be executed only when the subtasks are within the request window interval, so that all the task execution intervals are:
Figure BDA0002790113550000061
wherein t issiAs a subtask siOccupied time interval, tau is the time needed by system switching task, tau is far less than delta tsi. The total time consumption was:
tsum=799τ+max(Δt1-799τ,Δt2-798τ,…,Δt800)。
as shown in fig. 1, if an execution failure occurs during each task execution, re-execution is allowed, but the task is not attempted again after T attempts are still failed, and the next task is continued.
In the task execution process, the conditions that the subtask request is rejected by the server, the subtask is overtime and the like may occur, and an exception handling mode based on speed control is designed in the embodiment, and the whole work is ensured to be in a safe and stable state by dynamically adjusting the size of the request window. As shown in fig. 3, the process can be abstracted as the following function:
Figure BDA0002790113550000071
the request window size window (t) increases exponentially with a as the base from 0 to reach a predetermined threshold th1Then, the linear growth is converted into linear growth by taking k as a coefficient; let t be tmTime (shown as t in the above equation)2Moment) the access of the subtask is rejected, the size of the sliding window (t) is set to 0, and the threshold th is setn(th in the above formula2) Is set as the original threshold thn-1(th in the above formula1) 1/2, re-enters exponential growth. Let t be tmTime (shown as t in the above equation)4Time of day) an access timeout occurs, and the size of the request window (t) and the size of the threshold are both reduced to 1/2 and kept linearly increasing.
S4, filtering the page content of the corresponding target website by adopting the information extraction mode F specified in the step S1, and saving the effective information in a database or outputting the effective information as an excel table.
Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims (2)

1. A multi-target-oriented social public safety risk data acquisition method is characterized by comprising the following steps:
s1, crawling the first Q pieces of information of each target website, wherein each website is a crawling target, and each piece of information is a subtask; determining target websites to be crawled, and setting the number of the target websites as S; defining the subtask matrix as:
Figure FDA0002790113540000011
wherein, axyA y-th subtask representing an x-th target web site, x being 1,2,.., S, y being 1, 2.., Q;
and (3) formulating a corresponding extraction mode for each target website, wherein a matrix for defining the extraction modes is as follows:
F=[f1 f2 f3 f4...fS];
wherein f isxX is 1, 2.., and S represents an extraction mode of the xth target website;
s2, defining permutations
Figure FDA0002790113540000012
Mapping and combining each row in the subtask matrix A to form a new task queue Sa
Sa=[A1π,A2π,…,AQπ]
=[s1,s2,…,sL]
Wherein k is1,k2,…,kSIs a rearrangement of the sequence numbers 1,2, …, S, AwRepresents w-th row in the subtask matrix a, w being 1, 2. A. thewPi is the subtask frame structure, s1,s2,...,sLFor the mapped reordered subtasks, L ═ S × Q;
s3, starting crawling: for the same subtask si1,2, L, at time tiSending a request, at time ti+ΔtiReceiving a server response for a time period Δ tiTo wait for response time, use is made of Δ tiSending a plurality of requests again to form a request window; the request window sequentially slides from left to right on the task queue of step S2, and all the subtasks are allowed to be executed only when the subtasks are within the request window interval, so that all the task execution intervals are:
Figure FDA0002790113540000021
wherein t issiAs a subtask siThe occupied time interval is tau which is the time required by the system to switch tasks; the total time consumption was:
tsum=799τ+max(Δt1-799τ,Δt2-798τ,…,ΔtL);
and S4, after the crawling is finished, filtering the page content of the corresponding target website by adopting the information extraction mode F specified in the step S1, and storing the effective information into a database or outputting the effective information as an excel form.
2. The method according to claim 1, wherein in step S3, the request window size window (t) is increased exponentially with a as a base number starting from 0 to reach a predetermined threshold th1Then, the linear growth is converted into linear growth by taking k as a coefficient; if t is tmAt the moment, the access of the subtask is rejected, the size window (t) of the sliding window is set to 0, and the threshold th is setnIs set as the original threshold thn-11/2, reentry into exponential growth; if t is equal to tmAt this point, an access timeout occurs, which reduces both the size of the request window (t) and the threshold size to 1/2 and keeps increasing linearly.
CN202011312018.7A 2020-11-20 2020-11-20 Multi-objective-oriented social public security risk data acquisition method Active CN112287200B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011312018.7A CN112287200B (en) 2020-11-20 2020-11-20 Multi-objective-oriented social public security risk data acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011312018.7A CN112287200B (en) 2020-11-20 2020-11-20 Multi-objective-oriented social public security risk data acquisition method

Publications (2)

Publication Number Publication Date
CN112287200A true CN112287200A (en) 2021-01-29
CN112287200B CN112287200B (en) 2023-12-01

Family

ID=74398427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011312018.7A Active CN112287200B (en) 2020-11-20 2020-11-20 Multi-objective-oriented social public security risk data acquisition method

Country Status (1)

Country Link
CN (1) CN112287200B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115297037A (en) * 2021-04-19 2022-11-04 中国移动通信集团安徽有限公司 Dial testing method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664559A (en) * 2018-03-30 2018-10-16 中山大学 A kind of automatic crawling method of website and webpage source code

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664559A (en) * 2018-03-30 2018-10-16 中山大学 A kind of automatic crawling method of website and webpage source code

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115297037A (en) * 2021-04-19 2022-11-04 中国移动通信集团安徽有限公司 Dial testing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112287200B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
US7487145B1 (en) Method and system for autocompletion using ranked results
US7499940B1 (en) Method and system for URL autocompletion using ranked results
US5892919A (en) Spell checking universal resource locator (URL) by comparing the URL against a cache containing entries relating incorrect URLs submitted by users to corresponding correct URLs
US8046347B2 (en) Method and apparatus for reconstructing a search query
CN106776768B (en) A kind of URL grasping means of distributed reptile engine and system
US9148329B1 (en) Resource constraints for request processing
US20110145287A1 (en) Predictive Resource Identification and Phased Delivery of Structured Documents
US20100281005A1 (en) Asynchronous Database Index Maintenance
US7818686B2 (en) System and method for accelerated web page navigation using keyboard accelerators in a data processing system
US9058392B1 (en) Client state result de-duping
US9928178B1 (en) Memory-efficient management of computer network resources
US20140172828A1 (en) Personalized search library based on continual concept correlation
US20080059507A1 (en) Changing number of machines running distributed hyperlink database
JP2008186157A (en) Webpage re-collection system
CN112287200A (en) Multi-target-oriented social public safety risk data acquisition method
US20180337930A1 (en) Method and apparatus for providing website authentication data for search engine
CN113678122A (en) Caching of potential search results
CN103581349A (en) Domain name resolution method and device
Zhong et al. A web crawler system design based on distributed technology
Bao et al. Metapath-guided credit allocation for identifying representative works
Li et al. FlashSchema: achieving high quality XML schemas with powerful inference algorithms and large-scale schema data
Makwana et al. An efficient technique for web log preprocessing using Microsoft Excel
Liu et al. An automaton-based index scheme supporting twig queries for on-demand XML data broadcast
CN108763583A (en) A kind of microblog hot topic extracting method and system based on keyword search
CN108268517A (en) The management method and system of label in database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant