CN108664559A - A kind of automatic crawling method of website and webpage source code - Google Patents

A kind of automatic crawling method of website and webpage source code Download PDF

Info

Publication number
CN108664559A
CN108664559A CN201810297883.5A CN201810297883A CN108664559A CN 108664559 A CN108664559 A CN 108664559A CN 201810297883 A CN201810297883 A CN 201810297883A CN 108664559 A CN108664559 A CN 108664559A
Authority
CN
China
Prior art keywords
website
webpage
source code
web
crawl
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810297883.5A
Other languages
Chinese (zh)
Inventor
杨智
陈锭敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201810297883.5A priority Critical patent/CN108664559A/en
Publication of CN108664559A publication Critical patent/CN108664559A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to a kind of automatic crawling methods of website and webpage source code, and webpage is crawled in the website determined so that the webpage crawled compares concentration, there is an obvious denominator, convenient to crawl webpage writing crawlers.And crawl webpage in specific website so that the target information to be crawled compares concentration, and complete can obtain must quickly crawl required information.On crawling website and webpage source code, the web-page requests that the browser that must can effectively disguise oneself as is sent out prevent from being identified as machine code by website to crawl website data.Pass through and certain stand-by period is set so that when website or network abnormal conditions occur and code reports an error out of service when not making a response to crawlers, web page source code can be crawled by the automatic code that must run for a long time.By adding agent IP address database, can be effectively prevented when reptile code is blocked the denied access websites IP, program can also replace IP and continue to crawl web page source code automatically.

Description

A kind of automatic crawling method of website and webpage source code
Technical field
The present invention relates to the technical field of web crawlers more particularly to a kind of website and webpage source code sides of crawling automatically Method.
Background technology
With the fast development of Internet technology, the information data on network is in explosive growth.This makes on network It is more and more difficult to find our information datas of needs.How it is for statistical analysis to these diversity, real-time data from And the valuable information for obtaining data behind seems very significant.Exactly under such background, recent years big data Technology rapidly develops, more and more extensive in the application of all trades and professions.It to utilize mass data to analyze information, how to obtain on network Data and being stored just seem and be even more important.
Current people when searching some data, be largely by search engine search for then on website it is direct Browsing obtains.Although this fairly simple convenience of method, when data volume is huge and needs to store, this side Method just needs to take a substantial amount of time and can not often analyze the information for obtaining our needs from these data.
When data volume is more huge, used technical solution is to crawl web page source generation using web crawlers at present Code, then therefrom extract the data of our needs.Web crawlers is a kind of program automatically extracting webpage, mainly has universal network to climb Worm and focused web crawler.Such as application No. is the Chinese invention patent of CN201410021505.6, application discloses a kind of number According to acquisition methods.Its data acquisition is divided into following steps:Initial data analysis and participle operation, obtain keyword; It is scanned for using the keyword, obtains the page of website;Information crawler is carried out to the page of the website, according to the net The source file for the page stood carries out canonical matching with positive, obtains matching then expression formula result;According to the keyword or with mark Label library is compared, and is positioned over the information in the matching result in the corresponding field in corresponding video attribute library respectively; Data analysis is carried out, amount is repeated according to the weight of the page of the website and information to carry out data calculating;For Auto-writing Content carry out edit validation and processing.In another example application No. is the Chinese invention patent of CN201310198598.5, application A kind of web page crawl method and system are disclosed, and this method includes being trained to obtain data extraction conditions to sample web page;It crawls Web page;The web page crawled is parsed into dom tree constructions, obtains web page dom trees, item is extracted according to the data Part analyzes the web page dom trees, extracts required data.
In above-mentioned existing technical solution, the website data obtained using search engine search, due to web database technology Pang Greatly, target data can not be obtained in the short time.And when target data amount is also very huge, number is searched for using search engine It can undoubtedly be taken a substantial amount of time according to then searching to download one by one, and some data can be inevitably missed by checking for human eye. And web page source code is crawled to extract data using web crawlers, target data can must be quick and precisely got really, but It is not explained further to crawling web page source code in existing technical solution.And we are when crawling webpage, by Anti- reptile can be set in Website server, it is the request initiated by automated procedures that can directly refuse those obviously.In anti-reptile The most common means in the inside are exactly to judge whether the request is sent out by browser, if request frequency is too fast, sends out in a short time Send a large amount of request.These means often result in us and can not be successfully to crawl the web page source code of website.
Therefore, how to enable reptile code is high-frequency in a short time to crawl web page source code, therefrom extract valuable The problem of information of value, is urgently to be resolved hurrily.
Invention content
It is an object of the invention to overcome the deficiencies of the prior art and provide one kind can avoid being identified as automatic machine by website Device code, can for a long time automatically operation code crawl web page source code, when reptile code is blocked IP denied access website When can also continue to crawl the automatic crawling method of website and webpage source code of web page source code.
To achieve the above object, technical solution provided by the present invention is:
Include the following steps:
S1, it determines the website containing target information, analyzes the website and determine that webpage where target information and those contain There is the distinctive common feature of target information webpage;
The URL for the Initial page and the webpage that S2, load crawl;
S3, load will crawl the common feature of webpage;
S4, the source code for crawling Initial page, and the URL of the webpage containing target information is checked for, if not depositing Terminate to crawl program shortly, if in the presence of the web page source code containing target information is crawled one by one, believe comprising target until all The web page crawl of breath finishes;
S5, the target source code crawled is stored in the specified folder of designated position, the later stage is facilitated to extract target Information.
Further, when the step S4 crawls web page source code, crawlers are disguised oneself as into computer browser to website Server sends web-page requests, and crawlers is avoided to be identified as machine code and denied access by website.
Further, described when sending web-page requests to Website server using crawlers, modification sends message Request header, modification is with reference to the parameter for sending the request header that the when of asking uses in browser to website so that the request of reptile code More like the access of browser;The parameters for the web-page requests message that browser is sent to website are searched by developing instrument.
Further, the crawlers disguise oneself as computer browser every time to Website server send web-page requests when, It can monitor whether the web-page requests sent out are responded, if request is not responded for the first time, enter the stand-by period, wait for Web-page requests are sent out after time again;If continuous 5 times send out web-page requests and do not responded, Agent IP is replaced, It replaces the subsequent supervention of Agent IP and send web-page requests.
Further, the Agent IP using reptile from Agent IP website crawl gained, and the Agent IP crawled into Database is written after row detection confirmation is available;When needing replacing the IP of crawlers, then available IP is obtained from database It is written in the IP parameters of crawlers.
Compared with prior art, this programme principle and advantage are as follows:
Determine the website containing target information first, analyze the website determine webpage where target information and those Contain the distinctive common feature of target information webpage;Then it loads the URL of the Initial page and the webpage that crawl and to climb Take the common feature of webpage;In addition, the source code of Initial page is crawled, and checks for the webpage containing target information URL, if there is no terminating to crawl program, if in the presence of the web page source code containing target information is crawled one by one, until all Including the web page crawl of target information finishes;Finally, the target source code crawled is stored in the specified file of designated position In folder, the later stage is facilitated to extract target information.
Wherein, specifically, webpage is crawled in the website determined so that the webpage crawled compares concentration, has and compares Apparent denominator conveniently writes crawlers and crawls webpage.And crawl webpage in specific website so as to be crawled Target information compares concentration, complete can obtain and rapidly crawl required information.
When crawling web page source code, by crawlers disguise oneself as computer browser to Website server send web-page requests, Crawlers are avoided to be identified as machine code and denied access by website.
Stand-by period is set, when this abnormal conditions occur when website or network and not make a response to crawlers Code report an error it is out of service, allow to for a long time automatically operation code crawl web page source code.
The IP of Agent IP website is crawled using reptile, and detects whether the IP crawled can be used, if available, data are written Library, if cannot, it rejects.When needing replacing the IP of crawlers, then available IP write-ins reptile is obtained from database In the IP parameters of program.In this way, when computer IP is sealed, using Agent IP, website is made to be thought as that an other computer is accessing The website and proceed to respond to reptile transmission web-page requests, to ensure crawlers can long-time high-frequency must run down It goes.
Description of the drawings
Fig. 1 is a kind of flow chart of the automatic crawling method of website and webpage source code in the present invention.
Specific implementation mode
The present invention is further explained in the light of specific embodiments:
A kind of automatic crawling method of website and webpage source code described in the present embodiment, includes the following steps:
S1, it determines the website containing target information, analyzes the website and determine that webpage where target information and those contain There is the distinctive common feature of target information webpage;
The URL for the Initial page and the webpage that S2, load crawl;(URL, that is, uniform resource locator is to can be from The position of the resource obtained on internet and a kind of succinct expression of access method, are the addresses of standard resource on internet. For each file on internet there are one unique URL, the information that it includes points out that the position of file and browser should How it is handled.Each webpage on website is a file, they are all preserved in defined location, this position is just It is to be embodied with URL, when some webpage of browser access, just needs to propose that application will access the webpage to website, this The information of a application includes just URL, it specifies the specific webpage for needing to access.Crawlers are crawling web page source code When, the URL for obtaining the webpage is just needed, could be filed a request to website, the source code of webpage is crawled)
S3, load will crawl the common feature of webpage;
S4, the source code for crawling Initial page, and the URL of the webpage containing target information is checked for, if not depositing Terminate to crawl program shortly, if in the presence of the web page source code containing target information is crawled one by one, believe comprising target until all The web page crawl of breath finishes;
Wherein, specifically, when crawling web page source code, by crawlers disguise oneself as computer browser to Website server send out Web-page requests are sent, crawlers is avoided to be identified as machine code and denied access by website.
It is explained as follows:When browsing certain website webpages in place using browser, computer is just established with the server of website One reliable connection, this connection are known as TCP connection.It establishes after connection, the browser on computer needs to access on website Webpage, communicated between them, all browsers are communicated with each Website server for convenience, are just provided One unified communication protocol, referred to as http agreements.This agreement provides, when needing to browse some webpage, needs to this The Website server of webpage sends a web-page requests will be the source code of the webpage if Website server respond request Be sent on computer, this source code open on computers be it is seen that webpage.When needs are sent to Website server When web-page requests, browser will be sent on a message to Website server.This message has several important parameters, So that Website server knows what model browser people use, which webpage is needed, the IP of computer is how many.If only It is simply to send web-page requests to Website server using python codes, the request message of python can use asking for acquiescence Seek head.For example, the default request head User-Agent values that python is used are Python-urllib/3.4, and browser access When User-Agent values be:Mozilla/5.0(Windows NT 6.1;WOW64)AppleWebKit/537.36(KHTML, like Gecko)Chrome/43.0.2357.124Safari/537.3.There is apparent difference between the two, it is easy to quilt Website identifies that machine code is accessing website or manual operation browser in access website.In order to avoid automating journey Sequence is rejected, and the request header of message is sent by modification when sending message to Website server using reptile code, with reference to this It is what to send the parameter for the request header that the when of asking uses to website in browser in computer so that the request of reptile code is more It seem the access of browser.The parameters for the web-page requests message that browser is sent to website, by opening in this computer After browser after the developing instrument of F12 keys opening browser, this browser just can be found in developing instrument and is sent to website The parameters of web-page requests message.
During this step S4, also set up there are one the stand-by period, when crawlers normal operation, monitoring each time to Whether the web-page requests that Website server is sent out are responded, if the secondary request fails to be responded, wait for the regular hour Send out web-page requests again afterwards.If being responded, continue next step program, if still not responded, again etc. Wait for that the longer time sends request again.If continuous 5 times are not all responded, replaces the subsequent superventions of IP and send request.
Above-mentioned browser can send a message when sending web-page requests to Website server, and message the inside has several Parameter, wherein there are one the IP address for being computer, such Website server just can know that the request that whom is to its transmission, it is answered This webpage be dealt into where.When crawlers are in the webpage for crawling website, frequency is very fast.It is most common inside anti-reptile A kind of means are exactly to judge your request frequency.If sending a large amount of request in the short time, regardless of people, also can first seal IP is for a period of time.So this is just at a contradictory place:It climbs to be sealed soon very much, climbs too slow and time-consuming.This When, using Agent IP, when current IP is honored as a queen, change the value of the IP parameters inside request message.The present embodiment passes through foundation One database, provide stablize available Agent IP give reptile use.Its major function includes:It obtains and acts on behalf of from proxy web site IP;Agent IP is detected, detects whether can be used.Detection method is to access stable website, such as Baidu, Tencent etc. look into See conditional code;Database is written;Database is cleaned, not available IP is rejected;Obtain an available agency.
S5, the target source code crawled is stored in the specified folder of designated position, the later stage is facilitated to extract target Information.
The present embodiment determines the targeted website where target information, later which webpage of the determining target information in the website On, these webpages have any distinctive denominator.Later web page source code is crawled using these characteristics.And in existing technology It in scheme, in the method for obtaining data, first determines keyword, searches again for the webpage where keyword, crawl webpage, then from net Data are extracted in page.In contrast, the present embodiment has had determined the website where target information, then the net by determining website Page directly carries out crawling webpage.Apparent difference is exactly that the determination of webpage is different.The prior art is obtained by search key Specific webpage, due to the data volume on existing network it is very huge by keyword search come out webpage often at Thousand up to ten thousand.This makes the webpage amount that our crawlers to be crawled very huge, and the time to be crawled is very long.And And by keyword search, the information on network is varied, and acquired data are frequently not the data that people need, and are increased Burden.And the present embodiment just solves the problems, such as this.
In addition, crawlers are disguised oneself as computer browser to Website server transmission web-page requests by the present embodiment, avoid Crawlers are identified as machine code and denied access by website.In addition, certain stand-by period is set, this make when website or Code reports an error out of service when person's network abnormal conditions occurs and do not make a response to crawlers, automatic for a long time can must transport Line code crawls web page source code.Finally, agent IP address database is added, must can effectively prevent from being blocked when reptile code When the denied access websites IP, program can also replace IP and continue to crawl web page source code automatically.
The examples of implementation of the above are only the preferred embodiments of the invention, and the implementation model of the present invention is not limited with this It encloses, therefore changes made by all shapes according to the present invention, principle, should all cover within the scope of the present invention.

Claims (5)

1. a kind of automatic crawling method of website and webpage source code, which is characterized in that include the following steps:
S1, it determines the website containing target information, analyzes the website and determine that webpage where target information and those contain mesh Mark the distinctive common feature of Intelligence Page;
The URL for the Initial page and the webpage that S2, load crawl;
S3, load will crawl the common feature of webpage;
S4, the source code for crawling Initial page, and the URL of the webpage containing target information is checked for, if there is no i.e. End crawls program, if in the presence of the web page source code containing target information is crawled one by one, until all comprising target information Web page crawl finishes;
S5, the target source code crawled is stored in the specified folder of designated position, the later stage is facilitated to extract target information.
2. a kind of automatic crawling method of website and webpage source code according to claim 1, which is characterized in that the step S4 When crawling web page source code, crawlers are disguised oneself as into computer browser to Website server transmission web-page requests, avoid reptile Program is identified as machine code and denied access by website.
3. a kind of automatic crawling method of website and webpage source code according to claim 2, which is characterized in that described to utilize When crawlers send web-page requests to Website server, modification sends the request header of message, and modification refers in browser to net It stands and sends the parameter of the request header used when request so that access of the request of reptile code more like browser;Browser to The parameters for the web-page requests message that website is sent are searched by developing instrument.
4. a kind of automatic crawling method of website and webpage source code according to claim 2, which is characterized in that the reptile journey Sequence disguises oneself as computer browser when sending web-page requests to Website server every time, can monitor the web-page requests that send out whether To response, if request is not responded for the first time, enter the stand-by period, sending out webpage again after the stand-by period asks It asks;If continuous 5 times send out web-page requests and do not responded, Agent IP is replaced, the subsequent supervention of Agent IP is replaced and webpage is sent to ask It asks.
5. a kind of automatic crawling method of website and webpage source code according to claim 4, which is characterized in that the Agent IP Gained is crawled from Agent IP website using reptile, and the Agent IP crawled is detected after confirmation can be used and database is written; When needing replacing the IP of crawlers, then from the IP parameters for obtaining available IP write-ins crawlers in database.
CN201810297883.5A 2018-03-30 2018-03-30 A kind of automatic crawling method of website and webpage source code Pending CN108664559A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810297883.5A CN108664559A (en) 2018-03-30 2018-03-30 A kind of automatic crawling method of website and webpage source code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810297883.5A CN108664559A (en) 2018-03-30 2018-03-30 A kind of automatic crawling method of website and webpage source code

Publications (1)

Publication Number Publication Date
CN108664559A true CN108664559A (en) 2018-10-16

Family

ID=63783133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810297883.5A Pending CN108664559A (en) 2018-03-30 2018-03-30 A kind of automatic crawling method of website and webpage source code

Country Status (1)

Country Link
CN (1) CN108664559A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948026A (en) * 2019-03-28 2019-06-28 深信服科技股份有限公司 A kind of web data crawling method, device, equipment and medium
CN110929257A (en) * 2019-10-30 2020-03-27 武汉绿色网络信息服务有限责任公司 Method and device for detecting malicious codes carried in webpage
CN111310012A (en) * 2020-01-21 2020-06-19 国网安徽省电力有限公司滁州供电公司 Automatic monitoring and early warning method for enterprise information loss behavior
CN111538883A (en) * 2020-03-25 2020-08-14 北京市科学技术情报研究所 Data crawling method, system and equipment
CN112163138A (en) * 2020-09-11 2021-01-01 西南大学 Method for realizing data visualization of online shopping platform based on web crawler
CN112287200A (en) * 2020-11-20 2021-01-29 公安部第一研究所 Multi-target-oriented social public safety risk data acquisition method
CN112528118A (en) * 2020-12-17 2021-03-19 国家计算机网络与信息安全管理中心 Data acquisition method, system and device based on multi-channel proxy
CN112765366A (en) * 2021-01-24 2021-05-07 中国电子科技集团公司第十五研究所 APT (android Package) organization portrait construction method based on knowledge map
CN112800309A (en) * 2021-01-30 2021-05-14 上海应用技术大学 Crawler system based on HTTP proxy and implementation method thereof
CN113282893A (en) * 2021-04-27 2021-08-20 南方电网数字电网研究院有限公司 Source code reinforcing method and device, computer equipment and storage medium
CN113505287A (en) * 2021-06-24 2021-10-15 微梦创科网络科技(中国)有限公司 Website link detection method and system
CN114020987A (en) * 2022-01-06 2022-02-08 北京微步在线科技有限公司 Sample data acquisition method, device, equipment and storage medium based on webpage

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205076A1 (en) * 2001-03-06 2004-10-14 International Business Machines Corporation System and method to automate the management of hypertext link information in a Web site
US20070073758A1 (en) * 2005-09-23 2007-03-29 Redcarpet, Inc. Method and system for identifying targeted data on a web page
US20140189864A1 (en) * 2012-12-27 2014-07-03 Microsoft Corporation Identifying web pages in malware distribution networks
CN105930385A (en) * 2016-04-13 2016-09-07 珠海迈科智能科技股份有限公司 Data crawling method and system
CN106844522A (en) * 2016-12-29 2017-06-13 北京市天元网络技术股份有限公司 A kind of network data crawling method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205076A1 (en) * 2001-03-06 2004-10-14 International Business Machines Corporation System and method to automate the management of hypertext link information in a Web site
US20070073758A1 (en) * 2005-09-23 2007-03-29 Redcarpet, Inc. Method and system for identifying targeted data on a web page
US20140189864A1 (en) * 2012-12-27 2014-07-03 Microsoft Corporation Identifying web pages in malware distribution networks
CN105930385A (en) * 2016-04-13 2016-09-07 珠海迈科智能科技股份有限公司 Data crawling method and system
CN106844522A (en) * 2016-12-29 2017-06-13 北京市天元网络技术股份有限公司 A kind of network data crawling method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郑铁男等: "《数字编辑运营实训教程》", 30 September 2017, 知识产权出版社 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948026A (en) * 2019-03-28 2019-06-28 深信服科技股份有限公司 A kind of web data crawling method, device, equipment and medium
CN110929257B (en) * 2019-10-30 2022-02-01 武汉绿色网络信息服务有限责任公司 Method and device for detecting malicious codes carried in webpage
CN110929257A (en) * 2019-10-30 2020-03-27 武汉绿色网络信息服务有限责任公司 Method and device for detecting malicious codes carried in webpage
CN111310012A (en) * 2020-01-21 2020-06-19 国网安徽省电力有限公司滁州供电公司 Automatic monitoring and early warning method for enterprise information loss behavior
CN111538883A (en) * 2020-03-25 2020-08-14 北京市科学技术情报研究所 Data crawling method, system and equipment
CN111538883B (en) * 2020-03-25 2023-11-17 北京市科学技术情报研究所 Data crawling method, system and equipment
CN112163138A (en) * 2020-09-11 2021-01-01 西南大学 Method for realizing data visualization of online shopping platform based on web crawler
CN112287200A (en) * 2020-11-20 2021-01-29 公安部第一研究所 Multi-target-oriented social public safety risk data acquisition method
CN112287200B (en) * 2020-11-20 2023-12-01 公安部第一研究所 Multi-objective-oriented social public security risk data acquisition method
CN112528118A (en) * 2020-12-17 2021-03-19 国家计算机网络与信息安全管理中心 Data acquisition method, system and device based on multi-channel proxy
CN112765366A (en) * 2021-01-24 2021-05-07 中国电子科技集团公司第十五研究所 APT (android Package) organization portrait construction method based on knowledge map
CN112800309A (en) * 2021-01-30 2021-05-14 上海应用技术大学 Crawler system based on HTTP proxy and implementation method thereof
CN113282893A (en) * 2021-04-27 2021-08-20 南方电网数字电网研究院有限公司 Source code reinforcing method and device, computer equipment and storage medium
CN113505287A (en) * 2021-06-24 2021-10-15 微梦创科网络科技(中国)有限公司 Website link detection method and system
CN114020987A (en) * 2022-01-06 2022-02-08 北京微步在线科技有限公司 Sample data acquisition method, device, equipment and storage medium based on webpage

Similar Documents

Publication Publication Date Title
CN108664559A (en) A kind of automatic crawling method of website and webpage source code
CN104125209B (en) Malice website prompt method and router
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
US9614862B2 (en) System and method for webpage analysis
CN112131882A (en) Multi-source heterogeneous network security knowledge graph construction method and device
CN108566399B (en) Phishing website identification method and system
CN101895516B (en) Method and device for positioning cross-site scripting attack source
CN103279710B (en) Method and system for detecting malicious codes of Internet information system
CN105184159A (en) Web page falsification identification method and apparatus
CN101971591A (en) System and method of analyzing web addresses
CN108268635B (en) Method and apparatus for acquiring data
CN111163054B (en) Method and device for detecting malicious behavior of webpage
CN109768992A (en) Webpage malicious scanning processing method and device, terminal device, readable storage medium storing program for executing
US9792370B2 (en) Identifying equivalent links on a page
CN113518077A (en) Malicious web crawler detection method, device, equipment and storage medium
CN106446113A (en) Mobile big data analysis method and device
CN108667770A (en) A kind of loophole test method, server and the system of website
JP4935399B2 (en) Security operation management system, method and program
WO2017063274A1 (en) Method for automatically determining malicious-jumping and malicious-nesting offensive websites
EP3745292A1 (en) Hidden link detection method and apparatus for website
CN110020161B (en) Data processing method, log processing method and terminal
CN108280102B (en) Internet surfing behavior recording method and device and user terminal
CN111125485A (en) Website URL crawling method based on Scapy
CN107566371B (en) WebShell mining method for massive logs
CN108282478A (en) A kind of WEB site safeties detection method, device and computer-readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination