CN110941788A - Cloud environment distributed Web page extraction and analysis system and method for edge computing - Google Patents
Cloud environment distributed Web page extraction and analysis system and method for edge computing Download PDFInfo
- Publication number
- CN110941788A CN110941788A CN201911301759.2A CN201911301759A CN110941788A CN 110941788 A CN110941788 A CN 110941788A CN 201911301759 A CN201911301759 A CN 201911301759A CN 110941788 A CN110941788 A CN 110941788A
- Authority
- CN
- China
- Prior art keywords
- crawling
- computing
- page
- task
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000000605 extraction Methods 0.000 title claims abstract description 17
- 230000009193 crawling Effects 0.000 claims abstract description 110
- 238000007726 management method Methods 0.000 claims description 46
- 238000012544 monitoring process Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 14
- 238000005516 engineering process Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 10
- 230000002688 persistence Effects 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 5
- 239000002699 waste material Substances 0.000 abstract description 4
- 230000005540 biological transmission Effects 0.000 abstract description 3
- 230000004044 response Effects 0.000 abstract description 3
- 238000004891 communication Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 238000012423 maintenance Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 2
- 208000025174 PANDAS Diseases 0.000 description 1
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 1
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013439 planning Methods 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of network communication; the invention provides a cloud environment distributed Web page extraction and analysis system and method for edge computing, wherein a central management node schedules a computing node to complete a work task according to historical crawling efficiency, reasonable configuration of resources is improved, a crawling strategy and depth are analyzed and duplicate is removed, on the premise of ensuring correct bug scanning results, the crawling speed of a website is effectively improved, transverse expansion of scanning capacity and reasonable utilization of computing resources are facilitated, edge computing brings faster transmission and response speed, the problem of computing resource waste of nodes in a traditional cloud computing system is solved, the productivity ratio is improved, and the resource utilization rate is obviously improved.
Description
Technical Field
The invention relates to the technical field of network communication, in particular to a cloud environment distributed Web page extraction and analysis system and method for edge computing.
Background
Based on the cloud computing technology, the main work function is to provide support service for daily automated office work for related work departments and related workers, fully enable social development, improve intelligent management capacity, improve work execution efficiency, integrate urban planning and improve work handling efficiency; the cloud computing software service mode has the characteristics of integrating software and hardware resources, lowering client requirements and unifying maintenance platforms, and is applied to the construction of the e-government affair platform, so that data resources can be shared to the maximum extent, the construction and operation cost is saved, the load capacity of the platform is improved, and the maintenance difficulty is reduced. Cloud computing also has the characteristics of openness, distributed computing and storage, no boundary, virtualization, separation of data ownership and management right and the like, and brings brand-new security risks of data loss and leakage, shared technology loopholes, unsafe application program interfaces and the like.
With the maturity and use of cloud computing environments and the continuous expansion of the construction of each cloud platform, a website in the cloud environment inevitably has bugs, and hackers often steal important government information by using the network bugs, so that the information security is threatened, and the image and the public trust of the country are damaged. In the Web vulnerability scanning method in the prior art, basically, a scanning tool or hardware equipment is used for scanning the crawled website vulnerability, the processing speed of a common government office system (the average page files exceed 20000) is too low, and hours or even days are generally spent from the scanning to the analysis end, so that all pages under the website need to be grabbed on the premise of scanning each website. The efficiency of webpage crawling work is low, so that a large amount of time is needed for subsequent vulnerability scanning, and the problem of urgently waiting to be solved is to improve the crawling speed of the website on the premise of ensuring the correctness of vulnerability scanning results.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a cloud environment distributed Web page extraction and analysis system and method for edge computing.
In order to achieve the purpose, the invention provides the following technical scheme:
a cloud environment distributed Web page extraction and analysis system for edge computing adopts distributed deployment and comprises the following steps: the task monitoring unit is used for monitoring the scanning tasks submitted by the user and putting the new scanning tasks into a message queue; the central management node is used for distributing the work tasks to each computing node, collecting result data completed by the computing nodes, summarizing and analyzing each result data, and then performing data persistence processing; the plurality of computing nodes are used for completing the work tasks distributed by the central management node, and meanwhile, computing the execution of the work tasks to form computing results and sending the computing results to the central management node; the crawling strategy analysis module is used for analyzing the page content, the number of effective URLs and the number of ineffective URLs in the page are respectively in a factor form, calculating the depth value of the analyzed page and the depth value of the page contained in the page, and dynamically determining the crawling depth; the crawling depth analysis module is used for identifying and analyzing the content contained in the webpage and determining whether to finish the deep crawling of the identified page according to the branch path depth empirical value; and the page crawling duplicate removal module is used for improving the duplicate removal efficiency of the system, inserting the crawled links into the URL comparison binary tree by constructing a duplicate removal technology of the URL comparison binary tree, and for a new crawling task, firstly removing parameter values in the URL, and then comparing the new crawling task with the URL comparison binary tree to distinguish the repeated, similar, loop and other relations among the URLs.
Further, the crawling strategy analysis module analyzes the content of the crawling page each time, and stops crawling the current page when the depth value of the hyperlink contained in the page is smaller than the depth value of the current page, and returns to the upper page to perform processing of other pages.
Further, when the contents in the analyzed page are mainly multimedia and documents, the crawling depth analysis module automatically stops crawling analysis work on hyperlinks in the page, starts to return to a higher-level page to continue crawling of other pages, and records the branch path depth experience value of the analyzed page.
Further, in the page crawling duplication-removing module, establishing a URL comparison binary tree and dividing each access position in the website URL, wherein each access position is a node of the binary tree, and an access path existing in the website is constructed into the binary tree; and comparing each newly crawled page with the URL comparison binary tree, and only when the page is judged not to be crawled, crawling the page.
Furthermore, the central management node also comprises a resource distribution module, the resource distribution module determines the number of the distributed resources of the computing node according to the historical crawling efficiency and the real-time crawling efficiency, and after the crawling of a website is completed, the crawling efficiency of the website is modified according to the crawling efficiency and the historical crawling efficiency, and experience guidance is performed for the next crawling.
A cloud environment distributed Web page extraction and analysis method of edge computing comprises the following steps:
step 1, a user submits a scanning task through a Web page, and the scanning task is directly stored in a database.
And 2, the task monitoring unit monitors the scanning tasks submitted by the user, when finding a new scanning task, the task monitoring unit puts the new task into a message queue, the central management node reads the unexecuted task from the message queue, and the idle computing nodes are selected to distribute the tasks according to the running condition of each current computing node.
Step 3, each computing node feeds back to a central management node according to real-time crawling efficiency and historical experience, the central management node is required to increase or decrease the number of the computing nodes, when each computing node feeds back an interval value with real-time crawling delay close to or better than the historical experience value, the central management node can slowly increase the number of the computing nodes until the crawling delay fed back by each computing node is lower than the historical experience value, and the number of the computing nodes is stopped to be increased; and when the feedback real-time crawling delay of each computing node is lower than a threshold value of the historical experience value, reducing the number of computing nodes distributed to the website.
And 4, each computing node transmits the computing result after the task is executed back to the central management node, and the central management node finishes the collection and analysis of the computing results of the plurality of computing nodes and carries out data persistence processing.
Furthermore, a cache mechanism is adopted in the process of scheduling the work tasks by the central management node, a plurality of tasks are distributed to the idle computing node units at one time, and the burden of task scheduling on data reading is reduced.
In conclusion, the invention has the following beneficial effects:
according to the invention, the whole system is constructed in a distributed + edge computing mode, the distributed mode is adopted in the system, so that the transverse expansion of scanning capacity and the reasonable utilization of computing resources are facilitated, the edge computing brings faster transmission and response speeds, the problem of computing resource waste of nodes in the traditional cloud computing system is solved, the capacity ratio is improved, and the resource utilization rate is obviously improved; the multi-level storage technology reduces the time for accessing the DNS service area; the method has the advantages that the existing crawling strategies with breadth-first and depth-first are optimized, the problem that page extraction cannot be backed back in the process of automatically judging the scanning depth is solved, and meanwhile, the problem that the page of the whole website cannot be effectively acquired due to manual setting of the scanning depth or the crawling task cannot be completed within a certain time is solved; by adopting the duplication removal technology of the comparison binary tree, the URL duplication removal efficiency is greatly improved; according to two factors of historical crawling efficiency and real-time crawling efficiency, the number of resources which can be allocated to a computing node when a certain website is crawled is determined, and reasonable allocation of the resources is improved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a schematic view of the construction method of the present invention;
fig. 3 is a schematic diagram of a cloud computing architecture according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
As shown in fig. 1 to 3, a distributed Web page extraction and analysis system in a cloud environment of edge computing, which adopts distributed deployment, includes: the system comprises a task monitoring unit, a central management node, a plurality of computing nodes, a crawling strategy analysis module, a crawling depth analysis module and a page crawling duplication removal module.
And the task monitoring unit is used for monitoring the scanning tasks submitted by the user and putting the new scanning tasks into the message queue.
And the central management node is used for distributing the work tasks to each computing node, collecting result data completed by the computing nodes, summarizing and analyzing each result data, and then performing data persistence processing.
And the plurality of computing nodes are used for completing the work tasks distributed by the central management node, simultaneously computing the execution of the work tasks, forming computing results and sending the computing results to the central management node.
The whole system is constructed in a distributed + edge calculation mode, and the system adopts a distributed mode, so that the transverse expansion of the scanning capability is facilitated; the edge computing brings faster transmission and response speed, solves the problem of computing resource waste of nodes in the traditional cloud computing system, forms an operation mode of a central management node and computing nodes, improves the capacity ratio and obviously improves the resource utilization rate; the distributed + edge computing overall architecture well solves the computing capability of a single server, cannot meet the requirement of complex computing, improves the traditional many-to-one computing mode to a many-to-many computing mode, and greatly improves the computing capability of the overall system.
The crawling strategy analysis module is used for analyzing the page content, the number of effective URLs and the number of ineffective URLs in the page are respectively in a factor form, the number of effective URLs and the number of ineffective URLs in the page are used for calculating the depth value of the analyzed page and the depth value of the page contained in the page, and the crawling depth is dynamically determined; in practical application, effective URLs are correspondingly reduced along with the increase of the crawling depth, invalid URLs are increased due to the fact that the crawling depth is too large, computing resources and time are wasted, the invalid URL page is not analyzed practically, the crawling of the current page is stopped when the content of the crawling page is analyzed every time the depth value of hyperlinks contained in the page is smaller than the depth value of the current page, and the page returns to an upper page to perform processing of other pages; the method has the advantages that the existing crawling strategies with breadth-first and depth-first are optimized, the problem that page extraction cannot be backed back in the process of automatically judging the scanning depth is solved, and meanwhile, the problem that the page of the whole website cannot be effectively acquired due to manual setting of the scanning depth or a crawling task cannot be completed within a certain time is solved; due to the introduction of the page depth value numerical value, the problems that effective information of the website cannot be obtained due to trapping caused by too deep crawling depth and too shallow crawling depth in the website crawling process are well solved, and the overall crawling efficiency of the website is greatly improved.
The crawling depth analysis module is used for identifying and analyzing the content contained in the webpage and determining whether to finish the deep crawling of the identified page according to the branch path depth empirical value; in the actual use process, the fact that the overall structure of the same type of websites mostly tends to be the same or similar to the internal structure and content is found, when the internal content of the analyzed page is mainly multimedia and documents, crawling analysis work of hyperlinks in the page is automatically stopped, the website returns to a higher-level page to continue crawling of other pages, the experience value of the branch path depth of the analyzed page is recorded, the crawling depth of subsequent branch pages and websites is guided by the experience value of the branch path depth, the depth expectation value of the website is calculated for the branch crawling depth which is currently completed by the website, and the crawling depth of the subsequent page is determined by combining the expectation value with the recognition and analysis of the page content; the crawling depth analysis module effectively accelerates the overall crawling speed of the website.
The page crawling duplication-removing module is used for improving the duplication-removing efficiency of the system, inserting crawled links into a URL comparison binary tree by constructing a duplication-removing technology of the URL comparison binary tree, firstly removing parameter values in the URL for a new crawling task, then comparing the new crawling task with the URL comparison binary tree, distinguishing repeated, similar, loop and other relations among the URLs, filtering invalid URLs, filtering URLs, reducing time of repeated crawling and improving the overall efficiency of a scanner; establishing a URL comparison binary tree and dividing each access position in the URL of the website, wherein each access position is a node of the binary tree, and an access path existing in the website is constructed into the binary tree; each newly crawled page is compared with the URL comparison binary tree, and the page is crawled only when the page is judged not to be crawled; and the duplicate removal technology of the comparison binary tree is adopted, so that the URL duplicate removal efficiency is greatly improved.
The central management node also comprises a resource allocation module which determines the number of the allocated resources of the computing nodes according to the historical crawling efficiency and the implementation crawling efficiency, and in a distributed computing environment, when the number of the computing nodes allocated to a certain website is inappropriate, each computing node cannot exert the performance well or the number of the computing nodes cannot finish crawling of the website within a specified time due to the influence of factors such as server efficiency, service quality, network bandwidth, network environment, network safety environment and the like existing among different websites, for different websites, the system can record the historical crawling efficiency of the different websites, the historical crawling efficiency is adjusted according to the real-time crawling efficiency of the website, the number of the computing nodes for crawling the website is reduced or increased, the computing nodes are influenced by the network bandwidth, and the influence on the extracted page server is reduced at the same time, the central management node dynamically adjusts the number of the computing nodes by combining real-time crawling delay and historical experience of each computing node; when each computing node feeds back an interval value with real-time crawling delay close to or better than the historical experience value, the central management node slowly increases the number of the computing nodes until the crawling delay fed back by each computing node is lower than the historical experience value, and the number of the computing nodes is stopped increasing; when the feedback real-time crawling delay of each computing node is lower than a threshold value of the historical experience value, reducing the number of computing nodes for extracting the website page; meanwhile, the influence on the website is reduced, and the service quality of the website is ensured; after the crawling of one website is completed, according to the crawling efficiency and the historical crawling efficiency, the crawling efficiency of the website is modified, and experience guidance is conducted for the crawling of the next time.
In the DNS analysis process, the system adopts the local multi-level storage technology, reduces the frequency of sending DNS query requests to the DNS server, reduces the waiting time of network delay and improves the network access efficiency of the system; when the prior art accesses a domain name, if a query message is sent to a DNS server before a website IP address is acquired every time, DNS query analysis is carried out from the domain name server, and unnecessary query time is wasted, the system adopts a local multi-level storage technology to carry out DNS cache, after the corresponding relation between the domain name and the IP address is stored, the DNS query is not carried out on the DNS server when the domain name is accessed next time, and the query is carried out in the local storage, so that the access speed of a page is improved; the local system is provided with two forms of DNS caches, one is browser cache, the other is system cache, the TTL value of the DNS record in the cache is the maximum effective time of the cache, generally, about 60 seconds, and the DNS record is cleared when the maximum cache time is exceeded and the DNS record is not hit in a page; when the page is accessed next time, the DNS server needs to be accessed again through the network. Since the TTL value of the DNS record is often exceeded when the system visits the website again from the time the website is completed to the time the website is next visited, the DNS server needs to be queried again through the network when the website is visited again. The invention improves the local DNS caching technology by analyzing the characteristics of the DNS in the using process of the DNS analysis, and stores the domain name to be analyzed in the system in a local hard disk and a memory for a long time. When the DNS needs to be analyzed, firstly, DNS cache information in a memory is inquired, and when the DNS cache information is searched and hit, the DNS cache information is directly read from the memory; if the DNS information is not in the memory, the DNS storage information in the hard disk is loaded, if the DNS information is not hit, the DNS service is accessed through the network, if the DNS information is hit in the hard disk, the following pieces of information of the DNS record in the hard disk are simultaneously loaded according to the principle of the sequence and the locality of the acquired page, and the time waste caused by accessing the DNS server is solved through a multi-level storage technology.
The system carries out overall application architecture design according to the idea of 'one control and four-layer architecture':
one control: the safety management and control center provides domain name entry, crawling parameter setting, task adding, deleting, modifying, checking, starting, stopping and other function maintenance, account management, authority management and the like.
A four-layer architecture: the system comprises an account authentication management layer, a resource access channel, a functional component layer and a resource access layer, wherein the account authentication management layer, the resource access channel, the functional component layer and the resource access layer are respectively used for realizing the functions of account authentication management, the management function of a security control center on a cluster and the functions of page extraction and task distribution, and the resource access layer is used as a data channel during task execution.
The system can also be applied to other fields:
cloud computing: the scanning service is provided in a SaaS mode, is used for large-scale scanning, and has good elasticity and expansibility.
Rich libraries: the task execution is realized by adopting Python language, the Python language has rich libraries and libraries supporting scientific operation and artificial intelligence, and typical libraries comprise NumPy, SciPy, Matplotlib, Enthought library and pandas. The programming time is greatly shortened by using the existing library.
Expansibility: the expandability and the portability are strong.
The embodiment of the invention also provides a method for extracting and analyzing the distributed Web page with edge computing in the cloud environment, which comprises the following steps:
step 1, a user submits a scanning task through a Web page, and the scanning task is directly stored in a database.
Step 2, the task monitoring unit dynamically detects the database, the task monitoring unit is a functional module for monitoring the client to add tasks in real time, the realization modes are many, the method comprises the steps that the front end of the system actively pushes or the back end monitors the change of the database, when a new scanning task is found to be written into the database, the task monitoring unit puts the new task into a message queue, the central management node reads the unexecuted task from the message queue, and according to the current running condition of each computing node, an idle computing node is selected to distribute the task.
Step 3, each computing node feeds back to a central management node according to real-time crawling efficiency and historical experience, and the central management node is required to increase or decrease the number of computing nodes for scheduling; the central management node dynamically adjusts the number of the computing nodes by combining real-time crawling delay and historical experience of each computing node. When each computing node feeds back an interval value with the real-time crawling delay close to or better than the historical experience value, the central management node slowly increases the number of the computing nodes until the crawling delay fed back by each computing node is lower than the historical experience value, and the number of the computing nodes is stopped increasing. And when the feedback real-time crawling delay of each computing node is lower than a threshold value of the historical experience value, reducing the number of the computing nodes for extracting the website page.
And 4, each computing node transmits the computing result after the task is executed back to the central management node, the central management node finishes the collection and analysis of the computing results of the plurality of computing nodes, performs data persistence processing, and stores the extracted and analyzed result in a database.
In the task scheduling process, task reading and distribution are involved, and each task consumes system resources and time, so a cache mechanism can be adopted in the task scheduling unit to distribute a plurality of tasks to the idle computing node unit at one time, and the burden of task scheduling on data reading is reduced.
In fig. 3, a front-end page of the cloud computing environment is mainly used for interaction with a user and presentation of a result report; storing the page acquisition task, the intermediate state and the scanning result, and facilitating content display and task driving at the front end; the message queue is a mechanism for distributing tasks among machines, the redundancy and cache processing of scanning tasks are realized by using the message queue mechanism, and sometimes the task execution fails in the data processing process, so that the message queue is required to be used for redundancy processing, the task data is persistently stored, and the risk of message loss is eliminated; the task scheduling can reasonably schedule and method tasks in the message queue, and the task scheduling needs to have strong fault-tolerant processing in the period; the task executor executes a specific scanning task and reports a scanning state and a scanning result to the storage.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.
Claims (7)
1. A cloud environment distributed Web page extraction and analysis system of edge computing is characterized in that: adopting distributed deployment, comprising:
the task monitoring unit is used for monitoring the scanning tasks submitted by the user and putting the new scanning tasks into a message queue;
the central management node is used for distributing the work tasks to each computing node, collecting result data completed by the computing nodes, summarizing and analyzing each result data, and then performing data persistence processing;
the plurality of computing nodes are used for completing the work tasks distributed by the central management node, and meanwhile, computing the execution of the work tasks to form computing results and sending the computing results to the central management node;
the crawling strategy analysis module is used for analyzing the page content, the number of effective URLs and the number of ineffective URLs in the page are respectively in a factor form, calculating the depth value of the analyzed page and the depth value of the page contained in the page, and dynamically determining the crawling depth;
the crawling depth analysis module is used for identifying and analyzing the content contained in the webpage and determining whether to finish the deep crawling of the identified page according to the branch path depth empirical value;
and the page crawling duplication-removing module is used for improving the duplication-removing efficiency of the system, inserting the crawled links into the URL comparison binary tree by constructing a duplication-removing technology of the URL comparison binary tree, and for a new crawling task, firstly removing parameter values in the URL, and then comparing the new crawling task with the URL comparison binary tree to distinguish the repeated, similar and looped relations among the URLs.
2. The edge-computing cloud environment distributed Web page extraction and analysis system of claim 1, wherein: and the crawling strategy analysis module analyzes the content of the crawled page every time, stops crawling the current page when the depth value of the hyperlink contained in the page is smaller than the depth value of the current page, and returns to the upper page to perform processing of other pages.
3. The edge-computing cloud environment distributed Web page extraction and analysis system of claim 1, wherein: when the contents in the analyzed page are mainly multimedia and documents, the crawling depth analysis module automatically stops crawling analysis work of hyperlinks in the page, starts to return to a superior page to continue crawling of other pages, and records the branch path depth experience value of the analyzed page.
4. The edge-computing cloud environment distributed Web page extraction and analysis system of claim 1, wherein: in the page crawling duplication-removing module, establishing a URL comparison binary tree and dividing each access position in the website URL, wherein each access position is a node of the binary tree, and an access path existing in the website is constructed into the binary tree; and comparing each newly crawled page with the URL comparison binary tree, and only when the page is judged not to be crawled, crawling the page.
5. The edge-computing cloud environment distributed Web page extraction and analysis system of claim 1, wherein: the central management node also comprises a resource distribution module, the resource distribution module determines the number of the distributed resources of the computing node according to the historical crawling efficiency and the real-time crawling efficiency, after the crawling of a website is completed, the crawling efficiency of the website is modified according to the crawling efficiency and the historical crawling efficiency, and experience guidance is performed for the next crawling.
6. A cloud environment distributed Web page extraction and analysis method of edge computing is characterized by comprising the following steps: the method comprises the following steps:
step 1, a user submits a scanning task through a Web page, and the scanning task is directly stored in a database;
step 2, the task monitoring unit monitors the scanning tasks submitted by the user, when finding a new scanning task, the task monitoring unit puts the new task into a message queue, the central management node reads the unexecuted task from the message queue, and selects an idle computing node to distribute the task according to the running condition of each current computing node;
step 3, each computing node feeds back to a central management node according to real-time crawling efficiency and historical experience, the central management node is required to increase or decrease the number of the computing nodes, when each computing node feeds back an interval value with real-time crawling delay close to or better than the historical experience value, the central management node can slowly increase the number of the computing nodes until the crawling delay fed back by each computing node is lower than the historical experience value, and the number of the computing nodes is stopped to be increased; when the feedback real-time crawling delay of each computing node is lower than a threshold value of a historical experience value, reducing the number of computing nodes distributed to the website;
and 4, each computing node transmits the computing result after the task is executed back to the central management node, and the central management node finishes the collection and analysis of the computing results of the plurality of computing nodes and carries out data persistence processing.
7. The method for extracting and analyzing the Web pages in the cloud environment of the edge computing as claimed in claim 6, wherein: and a cache mechanism is adopted in the process of scheduling the work tasks by the central management node, a plurality of tasks are distributed to the idle computing node units at one time, and the burden of task scheduling on data reading is reduced.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911301759.2A CN110941788A (en) | 2019-12-17 | 2019-12-17 | Cloud environment distributed Web page extraction and analysis system and method for edge computing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911301759.2A CN110941788A (en) | 2019-12-17 | 2019-12-17 | Cloud environment distributed Web page extraction and analysis system and method for edge computing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110941788A true CN110941788A (en) | 2020-03-31 |
Family
ID=69911876
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911301759.2A Pending CN110941788A (en) | 2019-12-17 | 2019-12-17 | Cloud environment distributed Web page extraction and analysis system and method for edge computing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110941788A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112632566A (en) * | 2021-03-05 | 2021-04-09 | 腾讯科技(深圳)有限公司 | Vulnerability scanning method and device, storage medium and electronic equipment |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120102019A1 (en) * | 2010-10-25 | 2012-04-26 | Korea Advanced Institute Of Science And Technology | Method and apparatus for crawling webpages |
CN102930059A (en) * | 2012-11-26 | 2013-02-13 | 电子科技大学 | Method for designing focused crawler |
CN103092698A (en) * | 2012-12-24 | 2013-05-08 | 中国科学院深圳先进技术研究院 | System and method of cloud computing application automatic deployment |
CN103414718A (en) * | 2013-08-16 | 2013-11-27 | 蓝盾信息安全技术股份有限公司 | Distributed type Web vulnerability scanning method |
CN104408182A (en) * | 2014-12-15 | 2015-03-11 | 北京国双科技有限公司 | Method and device for processing web crawler data on distributed system |
CN106874487A (en) * | 2017-02-21 | 2017-06-20 | 国信优易数据有限公司 | A kind of distributed reptile management system and its method |
CN106919570A (en) * | 2015-12-24 | 2017-07-04 | 国家新闻出版广电总局广播科学研究院 | The page link duplicate removal scan method and device of a kind of network-oriented new media |
CN107026871A (en) * | 2017-05-15 | 2017-08-08 | 安徽大学 | Web vulnerability scanning method based on cloud computing |
CN108063759A (en) * | 2017-12-05 | 2018-05-22 | 西安交大捷普网络科技有限公司 | Web vulnerability scanning methods |
CN109284430A (en) * | 2018-09-07 | 2019-01-29 | 杭州艾塔科技有限公司 | Visualization subject web page content based on distributed structure/architecture crawls system and method |
WO2019153603A1 (en) * | 2018-02-06 | 2019-08-15 | 平安科技(深圳)有限公司 | Web page crawling configuration method, application server and computer readable storage medium |
CN110572448A (en) * | 2019-08-30 | 2019-12-13 | 烽火通信科技股份有限公司 | distributed edge cloud system architecture |
-
2019
- 2019-12-17 CN CN201911301759.2A patent/CN110941788A/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120102019A1 (en) * | 2010-10-25 | 2012-04-26 | Korea Advanced Institute Of Science And Technology | Method and apparatus for crawling webpages |
CN102930059A (en) * | 2012-11-26 | 2013-02-13 | 电子科技大学 | Method for designing focused crawler |
CN103092698A (en) * | 2012-12-24 | 2013-05-08 | 中国科学院深圳先进技术研究院 | System and method of cloud computing application automatic deployment |
CN103414718A (en) * | 2013-08-16 | 2013-11-27 | 蓝盾信息安全技术股份有限公司 | Distributed type Web vulnerability scanning method |
CN104408182A (en) * | 2014-12-15 | 2015-03-11 | 北京国双科技有限公司 | Method and device for processing web crawler data on distributed system |
CN106919570A (en) * | 2015-12-24 | 2017-07-04 | 国家新闻出版广电总局广播科学研究院 | The page link duplicate removal scan method and device of a kind of network-oriented new media |
CN106874487A (en) * | 2017-02-21 | 2017-06-20 | 国信优易数据有限公司 | A kind of distributed reptile management system and its method |
CN107026871A (en) * | 2017-05-15 | 2017-08-08 | 安徽大学 | Web vulnerability scanning method based on cloud computing |
CN108063759A (en) * | 2017-12-05 | 2018-05-22 | 西安交大捷普网络科技有限公司 | Web vulnerability scanning methods |
WO2019153603A1 (en) * | 2018-02-06 | 2019-08-15 | 平安科技(深圳)有限公司 | Web page crawling configuration method, application server and computer readable storage medium |
CN109284430A (en) * | 2018-09-07 | 2019-01-29 | 杭州艾塔科技有限公司 | Visualization subject web page content based on distributed structure/architecture crawls system and method |
CN110572448A (en) * | 2019-08-30 | 2019-12-13 | 烽火通信科技股份有限公司 | distributed edge cloud system architecture |
Non-Patent Citations (1)
Title |
---|
刘正;张国印;: "基于云计算的Web漏洞检测分析系统", no. 10 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112632566A (en) * | 2021-03-05 | 2021-04-09 | 腾讯科技(深圳)有限公司 | Vulnerability scanning method and device, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033123B (en) | Big data-based query method and device, computer equipment and storage medium | |
CN106874487B (en) | Distributed crawler management system and method thereof | |
CN112527489B (en) | Task scheduling method, device, equipment and computer readable storage medium | |
CN102521712B (en) | A kind of process instance data processing method and device | |
CN100538646C (en) | A kind of method and apparatus of in distributed system, carrying out the SQL script file | |
CN105893542B (en) | A kind of cold data file redistribution method and system in cloud storage system | |
CN104144142A (en) | Web vulnerability discovery method and system | |
CN109885642B (en) | Hierarchical storage method and device for full-text retrieval | |
CN106951179A (en) | A kind of data migration method and device | |
CN106202459A (en) | Relevant database storage performance optimization method under virtualized environment and system | |
CN109460345A (en) | The calculation method and system of real time data | |
CN108228322A (en) | A kind of distributed link tracking, analysis method and server, global scheduler | |
CN111966283A (en) | Client multi-level caching method and system based on enterprise-level super-computation scene | |
RU2005130257A (en) | SYSTEMS AND METHODS OF PREVENTING INTRODUCTION FOR NETWORK SERVERS | |
CN108614847A (en) | A kind of caching method and system of data | |
CN113626151B (en) | Container cloud log collection resource control method and system | |
CN110941788A (en) | Cloud environment distributed Web page extraction and analysis system and method for edge computing | |
Yu et al. | Sasm: Improving spark performance with adaptive skew mitigation | |
CN116974994A (en) | High-efficiency file collaboration system based on clusters | |
CN111078975A (en) | Multi-node incremental data acquisition system and acquisition method | |
CN115993932A (en) | Data processing method, device, storage medium and electronic equipment | |
CN110134615A (en) | The method and device of application program acquisition daily record data | |
CN111290855B (en) | GPU card management method, system and storage medium for multiple GPU servers in distributed environment | |
CN114020446A (en) | Cross-multi-engine routing processing method, device, equipment and storage medium | |
CN111339388B (en) | Information crawling system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20240419 |