CN112287200A - Multi-target-oriented social public safety risk data acquisition method - Google Patents
Multi-target-oriented social public safety risk data acquisition method Download PDFInfo
- Publication number
- CN112287200A CN112287200A CN202011312018.7A CN202011312018A CN112287200A CN 112287200 A CN112287200 A CN 112287200A CN 202011312018 A CN202011312018 A CN 202011312018A CN 112287200 A CN112287200 A CN 112287200A
- Authority
- CN
- China
- Prior art keywords
- subtask
- target
- crawling
- window
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 230000009193 crawling Effects 0.000 claims abstract description 13
- 238000013507 mapping Methods 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 12
- 230000004044 response Effects 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 4
- 230000008707 rearrangement Effects 0.000 claims description 2
- 238000006073 displacement reaction Methods 0.000 abstract 1
- 238000012163 sequencing technique Methods 0.000 abstract 1
- 230000008569 process Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Algebra (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a social public safety risk data acquisition method facing multiple targets, which comprises the steps of firstly determining multiple crawling targets and making an extraction rule for each target, then sequencing all subtasks by adopting displacement mapping to form multiple short-time sequences, and executing each subtask in sequence by adopting a sliding window method with a self-adaptive size until the crawling is finished. The method provides a crawler method based on the permutation mapping and the self-adaptive rate, and the efficiency of the crawler can be improved as much as possible under the condition of ensuring the safety of the crawler.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a social public safety risk data acquisition method facing multiple targets.
Background
With the advent of the big data era, the data volume in the internet is increased rapidly, and the data volume contains a lot of data related to social public security risks, the data volume is large, the dimensionality is high, the data source is wide, the traditional collection method aiming at common internet information is difficult to meet the current requirements, and therefore how to quickly collect the social public security risk data becomes important.
The social public safety risk data have important significance for emergency treatment and risk prevention and control of the country and the government, and the main sources of the data are as follows: social network sites, mainstream news sites, government sounding sites and public opinion information publishing sites. At present, aiming at different data sources and targets, a web crawler is used, data crawling rules are designed to obtain target internet resources, various network resources and related links thereof can be selectively accessed, and finally certain analysis and filtering are carried out to extract effective information in the network resources and the effective information is permanently stored so as to facilitate subsequent inquiry and use.
Because each search engine takes certain protection measures to the server of the search engine, namely, the access times of the user to the search engine in a certain time are limited. In order not to infringe the search engine protection protocol, an effective method is to limit the number of accesses per unit time, but the method reduces the efficiency of the crawler, and in the present environment with huge data volume, huge time consumption is caused, and the data is easy to lose its timeliness. Most of the traditional crawler methods only consider the problem of crawling efficiency, but most of the websites adopt a strong anti-crawling mechanism at present, so that the safety problem of the crawler is caused.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a multi-target-oriented social public security risk data acquisition method, which can improve the efficiency of a crawler as much as possible under the condition of ensuring the security of the crawler.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-target-oriented social public safety risk data acquisition method comprises the following steps:
s1, crawling the first Q pieces of information of each target website, wherein each website is a crawling target, and each piece of information is a subtask; determining target websites to be crawled, and setting the number of the target websites as S; defining the subtask matrix as:
wherein, axyA y-th subtask representing an x-th target web site, x being 1,2,.., S, y being 1, 2.., Q;
and (3) formulating a corresponding extraction mode for each target website, wherein a matrix for defining the extraction modes is as follows:
F=[f1 f2 f3 f4...fS];
wherein f isxX is 1, 2.., and S represents an extraction mode of the xth target website;
s2, defining permutationsMapping and combining each row in the subtask matrix A to form a new task queue Sa:
Sa=[A1π,A2π,…,AQπ]
=[s1,s2,…,sL]
Wherein k is1,k2,…,kSIs a rearrangement of the sequence numbers 1,2, …, S, AwRepresents w-th row in the subtask matrix a, w being 1, 2. A. thewPi is the subtask frame structure, s1,s2,...,sLFor the mapped reordered subtasks, L ═ S × Q;
s3, starting crawling: for the same subtask si1,2, L, at time tiSending a request, at time ti+ΔtiReceiving a server response for a time period Δ tiTo wait for response time, use is made of Δ tiSending a plurality of requests again to form a request window; the request window sequentially slides from left to right on the task queue of step S2, and all the subtasks are allowed to be executed only when the subtasks are within the request window interval, so that all the task execution intervals are:
wherein t issiAs a subtask siThe occupied time interval is tau which is the time required by the system to switch tasks; the total time consumption was:
tsum=799τ+max(Δt1-799τ,Δt2-798τ,…,ΔtL);
and S4, after the crawling is finished, filtering the page content of the corresponding target website by adopting the information extraction mode F specified in the step S1, and storing the effective information into a database or outputting the effective information as an excel form.
Further, in step S3, the request window size window (t) increases exponentially with a as a base from 0 to reach a predetermined threshold th1Then, the linear growth is converted into linear growth by taking k as a coefficient; if t is tmAt the moment, the access of the subtask is rejected, the size window (t) of the sliding window is set to 0, and the threshold th is setnIs set as the original threshold thn-11/2, reentry into exponential growth; if t is equal to tmAt this point, an access timeout occurs, which reduces both the size of the request window (t) and the threshold size to 1/2 and keeps increasing linearly.
The invention has the beneficial effects that: the method provides a crawler method based on the permutation mapping and the self-adaptive rate, and the efficiency of the crawler can be improved as much as possible under the condition of ensuring the safety of the crawler.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a structure of a request window according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a window size change under exception handling based on speed control according to an embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical solution, and the detailed implementation and the specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.
The embodiment provides a social public safety risk data acquisition method facing multiple targets. In this embodiment, the first 200 pieces of information of the main page of each large main news website need to be crawled, wherein each website is a crawling target, and each piece of information is a subtask. As shown in fig. 1, the method comprises the steps of:
s1, firstly, determining a target website to be crawled. The mainstream news websites selected in the embodiment include four subtasks, namely, a civil network, a newcastle, a china international broadcasting station and a china daily newspaper. Defining the subtask matrix as:
wherein, adjJ subtask of the human network, asjDenotes the j-th subtask of Xinhua corporation, atjRefers to the jth subtask of China International broadcasting station, aejThe j is the jth subtask in the Chinese diary, j represents a sequence number, and j is (1, 2.. multidot., 200).
And (3) formulating a corresponding extraction mode for each target website, such as a tag positioning mode, a range searching mode and the like, and specifically customizing the extraction mode. Defining a method matrix as follows:
F=[fa fb fc fd]
wherein f isaHuman net extraction method, fbIn the Xinhua extraction mode, fcFor the Chinese International broadcasting station, fdIs an extraction mode of Chinese daily newspapers.
S2, defining permutationsIn the formula k1、k2、k3、k4The sequence numbers 1,2, 3 and 4 are rearranged, and mapping combination is carried out on each row in the subtask matrix A to form a new task queue Sa:
Sa=[A1π,A2π,…,A200π]
=[s1,s2,…,s800]
Wherein A iswDenotes the w-th row in the subtask matrix a, w being 1, 2., 200; a. thewPi is the subtask frame structure, s1,s2,...,s800The sub-tasks reordered for mapping. Permutation pi maps each row of subtasks into a frame structure randomly, and ensures that the subtasks of a single target are distributed uniformly in a time domain.
S3, in the embodiment, a distributed crawler mode is adopted, namely, the same subtask S is subjected toiAt time tiSending a request, at time ti+ΔtiA server response is received. Thus, the time period Δ tiTo wait for response time, Δ t may be utilizediMultiple requests are sent again, forming a request window, as shown in fig. 2. The request window sequentially slides from left to right on the task queue of step S2, and all the subtasks are allowed to be executed only when the subtasks are within the request window interval, so that all the task execution intervals are:
wherein t issiAs a subtask siOccupied time interval, tau is the time needed by system switching task, tau is far less than delta tsi. The total time consumption was:
tsum=799τ+max(Δt1-799τ,Δt2-798τ,…,Δt800)。
as shown in fig. 1, if an execution failure occurs during each task execution, re-execution is allowed, but the task is not attempted again after T attempts are still failed, and the next task is continued.
In the task execution process, the conditions that the subtask request is rejected by the server, the subtask is overtime and the like may occur, and an exception handling mode based on speed control is designed in the embodiment, and the whole work is ensured to be in a safe and stable state by dynamically adjusting the size of the request window. As shown in fig. 3, the process can be abstracted as the following function:
the request window size window (t) increases exponentially with a as the base from 0 to reach a predetermined threshold th1Then, the linear growth is converted into linear growth by taking k as a coefficient; let t be tmTime (shown as t in the above equation)2Moment) the access of the subtask is rejected, the size of the sliding window (t) is set to 0, and the threshold th is setn(th in the above formula2) Is set as the original threshold thn-1(th in the above formula1) 1/2, re-enters exponential growth. Let t be tmTime (shown as t in the above equation)4Time of day) an access timeout occurs, and the size of the request window (t) and the size of the threshold are both reduced to 1/2 and kept linearly increasing.
S4, filtering the page content of the corresponding target website by adopting the information extraction mode F specified in the step S1, and saving the effective information in a database or outputting the effective information as an excel table.
Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.
Claims (2)
1. A multi-target-oriented social public safety risk data acquisition method is characterized by comprising the following steps:
s1, crawling the first Q pieces of information of each target website, wherein each website is a crawling target, and each piece of information is a subtask; determining target websites to be crawled, and setting the number of the target websites as S; defining the subtask matrix as:
wherein, axyA y-th subtask representing an x-th target web site, x being 1,2,.., S, y being 1, 2.., Q;
and (3) formulating a corresponding extraction mode for each target website, wherein a matrix for defining the extraction modes is as follows:
F=[f1 f2 f3 f4...fS];
wherein f isxX is 1, 2.., and S represents an extraction mode of the xth target website;
s2, defining permutationsMapping and combining each row in the subtask matrix A to form a new task queue Sa:
Sa=[A1π,A2π,…,AQπ]
=[s1,s2,…,sL]
Wherein k is1,k2,…,kSIs a rearrangement of the sequence numbers 1,2, …, S, AwRepresents w-th row in the subtask matrix a, w being 1, 2. A. thewPi is the subtask frame structure, s1,s2,...,sLFor the mapped reordered subtasks, L ═ S × Q;
s3, starting crawling: for the same subtask si1,2, L, at time tiSending a request, at time ti+ΔtiReceiving a server response for a time period Δ tiTo wait for response time, use is made of Δ tiSending a plurality of requests again to form a request window; the request window sequentially slides from left to right on the task queue of step S2, and all the subtasks are allowed to be executed only when the subtasks are within the request window interval, so that all the task execution intervals are:
wherein t issiAs a subtask siThe occupied time interval is tau which is the time required by the system to switch tasks; the total time consumption was:
tsum=799τ+max(Δt1-799τ,Δt2-798τ,…,ΔtL);
and S4, after the crawling is finished, filtering the page content of the corresponding target website by adopting the information extraction mode F specified in the step S1, and storing the effective information into a database or outputting the effective information as an excel form.
2. The method according to claim 1, wherein in step S3, the request window size window (t) is increased exponentially with a as a base number starting from 0 to reach a predetermined threshold th1Then, the linear growth is converted into linear growth by taking k as a coefficient; if t is tmAt the moment, the access of the subtask is rejected, the size window (t) of the sliding window is set to 0, and the threshold th is setnIs set as the original threshold thn-11/2, reentry into exponential growth; if t is equal to tmAt this point, an access timeout occurs, which reduces both the size of the request window (t) and the threshold size to 1/2 and keeps increasing linearly.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011312018.7A CN112287200B (en) | 2020-11-20 | 2020-11-20 | Multi-objective-oriented social public security risk data acquisition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011312018.7A CN112287200B (en) | 2020-11-20 | 2020-11-20 | Multi-objective-oriented social public security risk data acquisition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112287200A true CN112287200A (en) | 2021-01-29 |
CN112287200B CN112287200B (en) | 2023-12-01 |
Family
ID=74398427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011312018.7A Active CN112287200B (en) | 2020-11-20 | 2020-11-20 | Multi-objective-oriented social public security risk data acquisition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112287200B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115297037A (en) * | 2021-04-19 | 2022-11-04 | 中国移动通信集团安徽有限公司 | Dial testing method, device, equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664559A (en) * | 2018-03-30 | 2018-10-16 | 中山大学 | A kind of automatic crawling method of website and webpage source code |
-
2020
- 2020-11-20 CN CN202011312018.7A patent/CN112287200B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664559A (en) * | 2018-03-30 | 2018-10-16 | 中山大学 | A kind of automatic crawling method of website and webpage source code |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115297037A (en) * | 2021-04-19 | 2022-11-04 | 中国移动通信集团安徽有限公司 | Dial testing method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112287200B (en) | 2023-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7487145B1 (en) | Method and system for autocompletion using ranked results | |
US7499940B1 (en) | Method and system for URL autocompletion using ranked results | |
US5892919A (en) | Spell checking universal resource locator (URL) by comparing the URL against a cache containing entries relating incorrect URLs submitted by users to corresponding correct URLs | |
US8046347B2 (en) | Method and apparatus for reconstructing a search query | |
CN106776768B (en) | A kind of URL grasping means of distributed reptile engine and system | |
US9148329B1 (en) | Resource constraints for request processing | |
US20110145287A1 (en) | Predictive Resource Identification and Phased Delivery of Structured Documents | |
US20100281005A1 (en) | Asynchronous Database Index Maintenance | |
US7818686B2 (en) | System and method for accelerated web page navigation using keyboard accelerators in a data processing system | |
US9058392B1 (en) | Client state result de-duping | |
US9928178B1 (en) | Memory-efficient management of computer network resources | |
US20140172828A1 (en) | Personalized search library based on continual concept correlation | |
US20080059507A1 (en) | Changing number of machines running distributed hyperlink database | |
JP2008186157A (en) | Webpage re-collection system | |
CN112287200A (en) | Multi-target-oriented social public safety risk data acquisition method | |
US20180337930A1 (en) | Method and apparatus for providing website authentication data for search engine | |
CN113678122A (en) | Caching of potential search results | |
CN103581349A (en) | Domain name resolution method and device | |
Zhong et al. | A web crawler system design based on distributed technology | |
Bao et al. | Metapath-guided credit allocation for identifying representative works | |
Li et al. | FlashSchema: achieving high quality XML schemas with powerful inference algorithms and large-scale schema data | |
Makwana et al. | An efficient technique for web log preprocessing using Microsoft Excel | |
Liu et al. | An automaton-based index scheme supporting twig queries for on-demand XML data broadcast | |
CN108763583A (en) | A kind of microblog hot topic extracting method and system based on keyword search | |
CN108268517A (en) | The management method and system of label in database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |