CN107590265A - A kind of administrative ownership recognition methods in the website based on web crawlers - Google Patents

A kind of administrative ownership recognition methods in the website based on web crawlers Download PDF

Info

Publication number
CN107590265A
CN107590265A CN201710866237.1A CN201710866237A CN107590265A CN 107590265 A CN107590265 A CN 107590265A CN 201710866237 A CN201710866237 A CN 201710866237A CN 107590265 A CN107590265 A CN 107590265A
Authority
CN
China
Prior art keywords
url
website
information
key table
content information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710866237.1A
Other languages
Chinese (zh)
Inventor
邱煜铭
范渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co Ltd filed Critical DBAPPSecurity Co Ltd
Priority to CN201710866237.1A priority Critical patent/CN107590265A/en
Publication of CN107590265A publication Critical patent/CN107590265A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to a kind of administrative ownership recognition methods in website based on web crawlers, initialize pending Web Site Queue, create thread, according to the URL of pending website, obtain the top-level domain URL of the URL and crawl homepage, parsing obtains content information, check whether content information includes ownership place, if otherwise extract content information keyword and subordinate URL and its content information keyword is stored in the first key table and the second key table and matched respectively, current domain name place place-saving is then obtained comprising ICP information and syntactic analysis goes out districts and cities' information, if otherwise search information related to address in the first key table and the second key table and syntactic analysis, obtain matching degree highest common address, obtain the ownership place of website.This method utilizes crawler technology, and automatic detection and comparison are carried out to website URL, saves artificial judgement and processing time, and timely collection data improve data validity, and Detection results are good.

Description

A kind of administrative ownership recognition methods in the website based on web crawlers
Technical field
The present invention relates to the digital calculating equipment especially suitable for specific function or data processing equipment or data processing side The technical field of method, more particularly to a kind of administrative ownership in the website based on web crawlers for lifting data validity and Detection results Recognition methods.
Background technology
Information is covered the sky and the earth in current network environment, and true and false information is uneven, for some organization webs, especially Those are not only closely bound up with the common people but also with the organization web of profit property, once there is its false counterfeit website, user is most In the case of have no way of accurately differentiating its property., will if ownership place of the domain name of these organization webs in reality can be known Be advantageous to user and accurate judgement is carried out to its authenticity;And for internet management department, carry out network rectification, information is received During the routine inspections such as collection, mechanism verification, then need to answer to obtain the geographical position of the website by obtaining the geographical position of domain name With scene, the geographical position ownership place of website mechanism is the important information of a website, and the missing of the information will be interconnection webmaster Reason department supervised to website, carries out differentiation to website to accessing user brings bigger difficulty.
The information of a large amount of organization webs and imperfect on network at present, can not learn mechanism by intuitively site information Ownership place, the web page contents of a large amount of organization webs lack geographical position mark in internet, trace it to its cause usually to realize The illegal operation of website and omit accidentally when hiding geographical position, Web Hosting etc., and part body website is also weighed in the presence of name Phenomena such as multiple and instruction geography information is born the same name.For example, in existing network environment, the letter of website offer can not be only relied on Breath carrys out the true ownership place of the accurate judgement mechanism, and such as " Changsha County ", this place name can not only point to Hunan China province Changsha The Changsha County in city, the Changsha County of Vietnam can also be pointed to, for another example " phoenix town ", 16 phoenix of the same name are only there is within Chinese territory Phoenix town, each province in the whole nation is distributed in, is accurately to know if only the title by " phoenix people's government of township net " website Not going out the signified phoenix town in the website is the phoenix town in which province, so can not also determine be which phoenix town government's machine Network forming station.
In the prior art, for problems, typically entered by manually including the info web service platform of mechanism information Row retrieval, but by many network informations included have mechanism information missing or incomplete situation, and pass through people Human resources and cost are higher needed for work collection and arrangement site information, take relatively long, and there is also problem for real-time servicing renewal.
The content of the invention
In order to solve problems of the prior art, the present invention provides a kind of website row based on web crawlers of optimization Political affairs belong to recognition methods, and this method belongs to automatic detection, save artificial judgement and processing time, timely collection data, carry High data validity, Detection results are good.
The technical solution adopted in the present invention is a kind of administrative ownership recognition methods in the website based on web crawlers, described Method comprises the following steps:
Step 1:Pending Web Site Queue is initialized, creates thread;
Step 2:According to the URL of pending website, the top-level domain URL of the URL is obtained;
Step 3:The homepage of the top-level domain URL is crawled, parsing obtains the content information of homepage;
Step 4:Judge whether include ownership place title in content information;If so, the then ownership place entitled pending website Ownership place, return to step 2;If it is not, carry out in next step;
Step 5:Using name Entity recognition instrument, the keyword in content information is extracted, keyword is stored in the first keyword Table, and by current URL labeled as having captured URL;Two level URL in current URL content is put into pending Web Site Queue, climbed Taking the two level URL two level page, parsing obtains the content information of the two level page, and by the key in the content information of the two level page Word is stored in the second key table;
Step 6:First key table and the second key table are matched;Judge the first key table and the second key table In whether include ICP information, if so, the place-saving according to where ICP information acquisitions current domain name, matched pair by syntactic analysis The districts and cities' information answered, obtain the ownership place of website, return to step 2;If it is not, carry out in next step;
Step 7:By searching in the first key table and the second key table whether include the information related to address, to address Information carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website, return to step 2.
Preferably, in the step 2, if web page contents climb information comprising counter during crawling, URL is put back to pending In Web Site Queue, and carry out counter climbing processing.
Preferably, it is described it is counter climb processing include switch IP and reduction request rate.
Preferably, in the step 3, the content information of homepage includes name, place name, mechanism name, postcode and fax telephone.
Preferably, in the step 6, ICP information includes organization, website ICP credit numbers, web site name and website Homepage.
The invention provides a kind of administrative ownership recognition methods in the website based on web crawlers of optimization, by using network Reptile crawls the URL of any pending website and then crawls the homepage of the top-level domain URL, and parsing obtains the content letter of homepage Breath, judges whether include ownership place title in content information, if any then Direct Recognition, if nothing, uses name Entity recognition work Tool, the information for extracting the keyword in content information and subordinate URL are stored in the first key table and the second key table and matched, Using place-saving where the current domain name of ICP information acquisitions, or search in the first key table and the second key table with address phase The information of pass simultaneously carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website.This method utilizes Crawler technology, automatic detection and comparison are carried out to website URL, save artificial judgement and processing time, timely collection data, Data validity is improved, Detection results are good.
Brief description of the drawings
Fig. 1 is the flow chart of the present invention.
Embodiment
The present invention is described in further detail with reference to embodiment, but protection scope of the present invention is not limited to This.
The present invention relates to a kind of administrative ownership recognition methods in website based on web crawlers, methods described includes following step Suddenly.
In the present invention, info web is crawled by web crawlers, and web crawlers is otherwise known as webpage spider or net machine People or webpage follower, it is a kind of according to certain rule, automatically captures the program or script of web message, Ke Yitong Cross website URL and crawl web site contents.
Step 1:Initialization process Web Site Queue, create thread.
Step 2:According to the URL of pending website, the top-level domain URL of the URL is obtained;
In the step 2, if web page contents climb information comprising counter during crawling, URL is put back in pending Web Site Queue, And carry out counter climbing processing.
It is described it is counter climb processing include switch IP and reduction request rate.
In the present invention, for example, the second level domain of Shandong university is http://www.view.sdu.edu.cn/, Top-level domain corresponding to the domain name is sdu.edu.cn, therefore, by http://www.sdu.edu.cn/ and http:// Www.view.sdu.edu.cn/ is put into url queues to be crawled.
In the present invention, it is counter climb processing refer to some websites by limiting IP or the number of request in the particular account number unit interval Amount accesses to limit.Generally, can be completed by switching IP and reducing request rate.
Step 3:The homepage of the top-level domain URL is crawled, parsing obtains the content information of homepage.
In the step 3, the content information of homepage includes name, place name, mechanism name, postcode and fax telephone.
In the present invention, fully crawling for the content information of homepage is advantageous to improve the extraction efficiency to organization web ownership place With extraction accuracy.
Step 4:Judge whether include ownership place title in content information;If so, the then entitled pending net of the ownership place The ownership place stood, return to step 2;If it is not, carry out in next step.
Step 5:Using name Entity recognition instrument, the keyword in content information is extracted, keyword deposit first is closed Key word table, and by current URL labeled as having captured URL;Two level URL in current URL content is put into pending website team Row, the two level URL two level page being crawled, parsing obtains the content information of the two level page, and by the content information of the two level page Keyword be stored in the second key table.
In the present invention, name Entity recognition is also referred to as " proper name identification ", and referring to identify has the reality of certain sense in text Body, mainly including name, place name, mechanism name, proper noun etc., such as " Changsha County " be show geographical position there is specific meaning The entity of justice.
Step 6:First key table and the second key table are matched;Judge the first key table and the second key Whether ICP information is included in word table, if so, the place-saving according to where ICP information acquisitions current domain name, is matched by syntactic analysis Districts and cities' information, obtains the ownership place of website, return to step 2 corresponding to going out;If it is not, carry out in next step.
In the step 6, ICP information includes organization, website ICP credit numbers, web site name and website homepage.
In the present invention, syntactic analysis(Parsing)Refer to analyze the word grammatical function in sentence, such as " I It is late ", " I " is subject here, and " next " is predicate, and " evening " is complement.
In the present invention, using the organization names got, address information etc., entity recognition techniques are named in application again Keyword in content is identified, and by syntactic analysis, effective geographical location information is extracted, so as to Perfect mechanism The true ownership place of website.
Step 7:It is right by searching in the first key table and the second key table whether include the information related to address Address information carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website, return to step 2.
In the present invention, the related information in address includes the corresponding keywords such as contact method, organization.
The present invention crawls the URL of any pending website by using web crawlers and then crawls the top-level domain URL Homepage, parsing obtain homepage content information, judge whether include ownership place title in content information, if any, then directly knowledge Not, if nothing, using name Entity recognition instrument, the information for extracting the keyword in content information and subordinate URL is stored in first Key table and the second key table simultaneously match, and using place-saving where the current domain name of ICP information acquisitions, or search the first key The information related to address and syntactic analysis is carried out in word table and the second key table, match matching degree highest jointly Location, obtain the ownership place of website.This method utilizes crawler technology, and automatic detection and comparison are carried out to website URL, and saving is manually sentenced Fixed and processing time, timely collection data improve data validity, and Detection results are good.

Claims (5)

  1. A kind of 1. administrative ownership recognition methods in the website based on web crawlers, it is characterised in that:It the described method comprises the following steps:
    Step 1:Pending Web Site Queue is initialized, creates thread;
    Step 2:According to the URL of pending website, the top-level domain URL of the URL is obtained;
    Step 3:The homepage of the top-level domain URL is crawled, parsing obtains the content information of homepage;
    Step 4:Judge whether include ownership place title in content information;If so, the then ownership place entitled pending website Ownership place, return to step 2;If it is not, carry out in next step;
    Step 5:Using name Entity recognition instrument, the keyword in content information is extracted, keyword is stored in the first keyword Table, and by current URL labeled as having captured URL;Two level URL in current URL content is put into pending Web Site Queue, climbed Taking the two level URL two level page, parsing obtains the content information of the two level page, and by the key in the content information of the two level page Word is stored in the second key table;
    Step 6:First key table and the second key table are matched;Judge the first key table and the second key table In whether include ICP information, if so, the place-saving according to where ICP information acquisitions current domain name, matched pair by syntactic analysis The districts and cities' information answered, obtain the ownership place of website, return to step 2;If it is not, carry out in next step;
    Step 7:By searching in the first key table and the second key table whether include the information related to address, to address Information carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website, return to step 2.
  2. A kind of 2. administrative ownership recognition methods in website based on web crawlers according to claim 1, it is characterised in that:Institute State in step 2, if web page contents climb information comprising counter during crawling, URL is put back in pending Web Site Queue, and is carried out It is counter to climb processing.
  3. A kind of 3. administrative ownership recognition methods in website based on web crawlers according to claim 2, it is characterised in that:Institute State it is counter climb processing include switch IP and reduction request rate.
  4. A kind of 4. administrative ownership recognition methods in website based on web crawlers according to claim 1, it is characterised in that:Institute State in step 3, the content information of homepage includes name, place name, mechanism name, postcode and fax telephone.
  5. A kind of 5. administrative ownership recognition methods in website based on web crawlers according to claim 1, it is characterised in that:Institute State in step 6, ICP information includes organization, website ICP credit numbers, web site name and website homepage.
CN201710866237.1A 2017-09-22 2017-09-22 A kind of administrative ownership recognition methods in the website based on web crawlers Pending CN107590265A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710866237.1A CN107590265A (en) 2017-09-22 2017-09-22 A kind of administrative ownership recognition methods in the website based on web crawlers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710866237.1A CN107590265A (en) 2017-09-22 2017-09-22 A kind of administrative ownership recognition methods in the website based on web crawlers

Publications (1)

Publication Number Publication Date
CN107590265A true CN107590265A (en) 2018-01-16

Family

ID=61047544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710866237.1A Pending CN107590265A (en) 2017-09-22 2017-09-22 A kind of administrative ownership recognition methods in the website based on web crawlers

Country Status (1)

Country Link
CN (1) CN107590265A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108494829A (en) * 2018-02-26 2018-09-04 福建三鑫隆信息技术开发股份有限公司 A kind of method for safety monitoring and system
WO2020024896A1 (en) * 2018-08-03 2020-02-06 上海点融信息科技有限责任公司 Blockchain data search method and device and storage medium
CN111104579A (en) * 2019-12-31 2020-05-05 北京神州绿盟信息安全科技股份有限公司 Identification method and device for public network assets and storage medium
CN112257032A (en) * 2019-10-21 2021-01-22 国家计算机网络与信息安全管理中心 Method and system for determining APP responsibility subject

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105743921A (en) * 2016-04-08 2016-07-06 安徽电信规划设计有限责任公司 Site information management method for IDC machine room
CN106096040A (en) * 2016-06-29 2016-11-09 中国人民解放军国防科学技术大学 Organization web ownership place method of discrimination based on search engine and device thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105743921A (en) * 2016-04-08 2016-07-06 安徽电信规划设计有限责任公司 Site information management method for IDC machine room
CN106096040A (en) * 2016-06-29 2016-11-09 中国人民解放军国防科学技术大学 Organization web ownership place method of discrimination based on search engine and device thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张季谦等: "《网页设计与制作》", 31 January 2017 *
杨明刚: "《大数据时代的网络舆情》", 30 June 2017 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108494829A (en) * 2018-02-26 2018-09-04 福建三鑫隆信息技术开发股份有限公司 A kind of method for safety monitoring and system
WO2020024896A1 (en) * 2018-08-03 2020-02-06 上海点融信息科技有限责任公司 Blockchain data search method and device and storage medium
CN112257032A (en) * 2019-10-21 2021-01-22 国家计算机网络与信息安全管理中心 Method and system for determining APP responsibility subject
CN112257032B (en) * 2019-10-21 2023-07-14 国家计算机网络与信息安全管理中心 Method and system for determining APP responsibility main body
CN111104579A (en) * 2019-12-31 2020-05-05 北京神州绿盟信息安全科技股份有限公司 Identification method and device for public network assets and storage medium

Similar Documents

Publication Publication Date Title
CN104899508B (en) A kind of multistage detection method for phishing site and system
CN103218431B (en) A kind ofly can identify the system that info web gathers automatically
CN106096040B (en) Organization web ownership place method of discrimination and its device based on search engine
CN101534306B (en) Detecting method and a device for fishing website
CN102841920B (en) Method and device for extracting webpage frame information
CN107590265A (en) A kind of administrative ownership recognition methods in the website based on web crawlers
US8682882B2 (en) System and method for automatically identifying classified websites
CN106095979B (en) URL merging processing method and device
CN104516949B (en) Web data treating method and apparatus, inquiry processing method and question answering system
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
JP2012500427A (en) Providing regional content by matching geographic characteristics
CN102483756A (en) An assistant-adviser using the semantic analysis of community exchanges
CN107590236B (en) Big data acquisition method and system for building construction enterprises
CN102073960A (en) Method for assessing operation effect in website marketing process
CN108092963A (en) Web page identification method, device, computer equipment and storage medium
CN105975523A (en) Hidden hyperlink detection method based on stack
US20130179421A1 (en) System and Method for Collecting URL Information Using Retrieval Service of Social Network Service
CN103338260A (en) Distributed analytical system and analytical method for URL logs in network auditing
CN107800686A (en) A kind of fishing website recognition methods and device
CN110020161B (en) Data processing method, log processing method and terminal
CN106446123A (en) Webpage verification code element identification method
Xu et al. Multi-modal description of public safety events using surveillance and social media
WO2015074455A1 (en) Method and apparatus for computing url pattern of associated webpage
Tyner et al. Tweeting the Laurentian Great Lakes: A community opinion analysis about Great Lakes areas as assessed through mentions on Twitter
US20220292253A1 (en) Automated structured data object creation and location integration into multiple location applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180116