CN107590265A

CN107590265A - A kind of administrative ownership recognition methods in the website based on web crawlers

Info

Publication number: CN107590265A
Application number: CN201710866237.1A
Authority: CN
Inventors: 邱煜铭; 范渊
Original assignee: DBAPPSecurity Co Ltd
Current assignee: DBAPPSecurity Co Ltd
Priority date: 2017-09-22
Filing date: 2017-09-22
Publication date: 2018-01-16

Abstract

The present invention relates to a kind of administrative ownership recognition methods in website based on web crawlers, initialize pending Web Site Queue, create thread, according to the URL of pending website, obtain the top-level domain URL of the URL and crawl homepage, parsing obtains content information, check whether content information includes ownership place, if otherwise extract content information keyword and subordinate URL and its content information keyword is stored in the first key table and the second key table and matched respectively, current domain name place place-saving is then obtained comprising ICP information and syntactic analysis goes out districts and cities' information, if otherwise search information related to address in the first key table and the second key table and syntactic analysis, obtain matching degree highest common address, obtain the ownership place of website.This method utilizes crawler technology, and automatic detection and comparison are carried out to website URL, saves artificial judgement and processing time, and timely collection data improve data validity, and Detection results are good.

Description

A kind of administrative ownership recognition methods in the website based on web crawlers

Technical field

The present invention relates to the digital calculating equipment especially suitable for specific function or data processing equipment or data processing side The technical field of method, more particularly to a kind of administrative ownership in the website based on web crawlers for lifting data validity and Detection results Recognition methods.

Background technology

Information is covered the sky and the earth in current network environment, and true and false information is uneven, for some organization webs, especially Those are not only closely bound up with the common people but also with the organization web of profit property, once there is its false counterfeit website, user is most In the case of have no way of accurately differentiating its property., will if ownership place of the domain name of these organization webs in reality can be known Be advantageous to user and accurate judgement is carried out to its authenticity；And for internet management department, carry out network rectification, information is received During the routine inspections such as collection, mechanism verification, then need to answer to obtain the geographical position of the website by obtaining the geographical position of domain name With scene, the geographical position ownership place of website mechanism is the important information of a website, and the missing of the information will be interconnection webmaster Reason department supervised to website, carries out differentiation to website to accessing user brings bigger difficulty.

The information of a large amount of organization webs and imperfect on network at present, can not learn mechanism by intuitively site information Ownership place, the web page contents of a large amount of organization webs lack geographical position mark in internet, trace it to its cause usually to realize The illegal operation of website and omit accidentally when hiding geographical position, Web Hosting etc., and part body website is also weighed in the presence of name Phenomena such as multiple and instruction geography information is born the same name.For example, in existing network environment, the letter of website offer can not be only relied on Breath carrys out the true ownership place of the accurate judgement mechanism, and such as " Changsha County ", this place name can not only point to Hunan China province Changsha The Changsha County in city, the Changsha County of Vietnam can also be pointed to, for another example " phoenix town ", 16 phoenix of the same name are only there is within Chinese territory Phoenix town, each province in the whole nation is distributed in, is accurately to know if only the title by " phoenix people's government of township net " website Not going out the signified phoenix town in the website is the phoenix town in which province, so can not also determine be which phoenix town government's machine Network forming station.

In the prior art, for problems, typically entered by manually including the info web service platform of mechanism information Row retrieval, but by many network informations included have mechanism information missing or incomplete situation, and pass through people Human resources and cost are higher needed for work collection and arrangement site information, take relatively long, and there is also problem for real-time servicing renewal.

The content of the invention

In order to solve problems of the prior art, the present invention provides a kind of website row based on web crawlers of optimization Political affairs belong to recognition methods, and this method belongs to automatic detection, save artificial judgement and processing time, timely collection data, carry High data validity, Detection results are good.

The technical solution adopted in the present invention is a kind of administrative ownership recognition methods in the website based on web crawlers, described Method comprises the following steps：

Step 1：Pending Web Site Queue is initialized, creates thread；

Step 2：According to the URL of pending website, the top-level domain URL of the URL is obtained；

Step 3：The homepage of the top-level domain URL is crawled, parsing obtains the content information of homepage；

Step 4：Judge whether include ownership place title in content information；If so, the then ownership place entitled pending website Ownership place, return to step 2；If it is not, carry out in next step；

Step 5：Using name Entity recognition instrument, the keyword in content information is extracted, keyword is stored in the first keyword Table, and by current URL labeled as having captured URL；Two level URL in current URL content is put into pending Web Site Queue, climbed Taking the two level URL two level page, parsing obtains the content information of the two level page, and by the key in the content information of the two level page Word is stored in the second key table；

Step 6：First key table and the second key table are matched；Judge the first key table and the second key table In whether include ICP information, if so, the place-saving according to where ICP information acquisitions current domain name, matched pair by syntactic analysis The districts and cities' information answered, obtain the ownership place of website, return to step 2；If it is not, carry out in next step；

Step 7：By searching in the first key table and the second key table whether include the information related to address, to address Information carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website, return to step 2.

Preferably, in the step 2, if web page contents climb information comprising counter during crawling, URL is put back to pending In Web Site Queue, and carry out counter climbing processing.

Preferably, it is described it is counter climb processing include switch IP and reduction request rate.

Preferably, in the step 3, the content information of homepage includes name, place name, mechanism name, postcode and fax telephone.

Preferably, in the step 6, ICP information includes organization, website ICP credit numbers, web site name and website Homepage.

The invention provides a kind of administrative ownership recognition methods in the website based on web crawlers of optimization, by using network Reptile crawls the URL of any pending website and then crawls the homepage of the top-level domain URL, and parsing obtains the content letter of homepage Breath, judges whether include ownership place title in content information, if any then Direct Recognition, if nothing, uses name Entity recognition work Tool, the information for extracting the keyword in content information and subordinate URL are stored in the first key table and the second key table and matched, Using place-saving where the current domain name of ICP information acquisitions, or search in the first key table and the second key table with address phase The information of pass simultaneously carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website.This method utilizes Crawler technology, automatic detection and comparison are carried out to website URL, save artificial judgement and processing time, timely collection data, Data validity is improved, Detection results are good.

Brief description of the drawings

Fig. 1 is the flow chart of the present invention.

Embodiment

The present invention is described in further detail with reference to embodiment, but protection scope of the present invention is not limited to This.

The present invention relates to a kind of administrative ownership recognition methods in website based on web crawlers, methods described includes following step Suddenly.

In the present invention, info web is crawled by web crawlers, and web crawlers is otherwise known as webpage spider or net machine People or webpage follower, it is a kind of according to certain rule, automatically captures the program or script of web message, Ke Yitong Cross website URL and crawl web site contents.

Step 1：Initialization process Web Site Queue, create thread.

In the step 2, if web page contents climb information comprising counter during crawling, URL is put back in pending Web Site Queue, And carry out counter climbing processing.

It is described it is counter climb processing include switch IP and reduction request rate.

In the present invention, for example, the second level domain of Shandong university is http://www.view.sdu.edu.cn/, Top-level domain corresponding to the domain name is sdu.edu.cn, therefore, by http://www.sdu.edu.cn/ and http:// Www.view.sdu.edu.cn/ is put into url queues to be crawled.

In the present invention, it is counter climb processing refer to some websites by limiting IP or the number of request in the particular account number unit interval Amount accesses to limit.Generally, can be completed by switching IP and reducing request rate.

Step 3：The homepage of the top-level domain URL is crawled, parsing obtains the content information of homepage.

In the step 3, the content information of homepage includes name, place name, mechanism name, postcode and fax telephone.

In the present invention, fully crawling for the content information of homepage is advantageous to improve the extraction efficiency to organization web ownership place With extraction accuracy.

Step 4：Judge whether include ownership place title in content information；If so, the then entitled pending net of the ownership place The ownership place stood, return to step 2；If it is not, carry out in next step.

Step 5：Using name Entity recognition instrument, the keyword in content information is extracted, keyword deposit first is closed Key word table, and by current URL labeled as having captured URL；Two level URL in current URL content is put into pending website team Row, the two level URL two level page being crawled, parsing obtains the content information of the two level page, and by the content information of the two level page Keyword be stored in the second key table.

In the present invention, name Entity recognition is also referred to as " proper name identification ", and referring to identify has the reality of certain sense in text Body, mainly including name, place name, mechanism name, proper noun etc., such as " Changsha County " be show geographical position there is specific meaning The entity of justice.

Step 6：First key table and the second key table are matched；Judge the first key table and the second key Whether ICP information is included in word table, if so, the place-saving according to where ICP information acquisitions current domain name, is matched by syntactic analysis Districts and cities' information, obtains the ownership place of website, return to step 2 corresponding to going out；If it is not, carry out in next step.

In the step 6, ICP information includes organization, website ICP credit numbers, web site name and website homepage.

In the present invention, syntactic analysis（Parsing）Refer to analyze the word grammatical function in sentence, such as " I It is late ", " I " is subject here, and " next " is predicate, and " evening " is complement.

In the present invention, using the organization names got, address information etc., entity recognition techniques are named in application again Keyword in content is identified, and by syntactic analysis, effective geographical location information is extracted, so as to Perfect mechanism The true ownership place of website.

Step 7：It is right by searching in the first key table and the second key table whether include the information related to address Address information carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website, return to step 2.

In the present invention, the related information in address includes the corresponding keywords such as contact method, organization.

The present invention crawls the URL of any pending website by using web crawlers and then crawls the top-level domain URL Homepage, parsing obtain homepage content information, judge whether include ownership place title in content information, if any, then directly knowledge Not, if nothing, using name Entity recognition instrument, the information for extracting the keyword in content information and subordinate URL is stored in first Key table and the second key table simultaneously match, and using place-saving where the current domain name of ICP information acquisitions, or search the first key The information related to address and syntactic analysis is carried out in word table and the second key table, match matching degree highest jointly Location, obtain the ownership place of website.This method utilizes crawler technology, and automatic detection and comparison are carried out to website URL, and saving is manually sentenced Fixed and processing time, timely collection data improve data validity, and Detection results are good.

Claims

A kind of 1. administrative ownership recognition methods in the website based on web crawlers, it is characterised in that：It the described method comprises the following steps：

Step 1：Pending Web Site Queue is initialized, creates thread；

Step 2：According to the URL of pending website, the top-level domain URL of the URL is obtained；

Step 3：The homepage of the top-level domain URL is crawled, parsing obtains the content information of homepage；

Step 4：Judge whether include ownership place title in content information；If so, the then ownership place entitled pending website Ownership place, return to step 2；If it is not, carry out in next step；

Step 5：Using name Entity recognition instrument, the keyword in content information is extracted, keyword is stored in the first keyword Table, and by current URL labeled as having captured URL；Two level URL in current URL content is put into pending Web Site Queue, climbed Taking the two level URL two level page, parsing obtains the content information of the two level page, and by the key in the content information of the two level page Word is stored in the second key table；

Step 6：First key table and the second key table are matched；Judge the first key table and the second key table In whether include ICP information, if so, the place-saving according to where ICP information acquisitions current domain name, matched pair by syntactic analysis The districts and cities' information answered, obtain the ownership place of website, return to step 2；If it is not, carry out in next step；

Step 7：By searching in the first key table and the second key table whether include the information related to address, to address Information carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website, return to step 2.
A kind of 2. administrative ownership recognition methods in website based on web crawlers according to claim 1, it is characterised in that：Institute State in step 2, if web page contents climb information comprising counter during crawling, URL is put back in pending Web Site Queue, and is carried out It is counter to climb processing.
A kind of 3. administrative ownership recognition methods in website based on web crawlers according to claim 2, it is characterised in that：Institute State it is counter climb processing include switch IP and reduction request rate.
A kind of 4. administrative ownership recognition methods in website based on web crawlers according to claim 1, it is characterised in that：Institute State in step 3, the content information of homepage includes name, place name, mechanism name, postcode and fax telephone.
A kind of 5. administrative ownership recognition methods in website based on web crawlers according to claim 1, it is characterised in that：Institute State in step 6, ICP information includes organization, website ICP credit numbers, web site name and website homepage.