CN107590265A - A kind of administrative ownership recognition methods in the website based on web crawlers - Google Patents
A kind of administrative ownership recognition methods in the website based on web crawlers Download PDFInfo
- Publication number
- CN107590265A CN107590265A CN201710866237.1A CN201710866237A CN107590265A CN 107590265 A CN107590265 A CN 107590265A CN 201710866237 A CN201710866237 A CN 201710866237A CN 107590265 A CN107590265 A CN 107590265A
- Authority
- CN
- China
- Prior art keywords
- url
- website
- information
- key table
- content information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The present invention relates to a kind of administrative ownership recognition methods in website based on web crawlers, initialize pending Web Site Queue, create thread, according to the URL of pending website, obtain the top-level domain URL of the URL and crawl homepage, parsing obtains content information, check whether content information includes ownership place, if otherwise extract content information keyword and subordinate URL and its content information keyword is stored in the first key table and the second key table and matched respectively, current domain name place place-saving is then obtained comprising ICP information and syntactic analysis goes out districts and cities' information, if otherwise search information related to address in the first key table and the second key table and syntactic analysis, obtain matching degree highest common address, obtain the ownership place of website.This method utilizes crawler technology, and automatic detection and comparison are carried out to website URL, saves artificial judgement and processing time, and timely collection data improve data validity, and Detection results are good.
Description
Technical field
The present invention relates to the digital calculating equipment especially suitable for specific function or data processing equipment or data processing side
The technical field of method, more particularly to a kind of administrative ownership in the website based on web crawlers for lifting data validity and Detection results
Recognition methods.
Background technology
Information is covered the sky and the earth in current network environment, and true and false information is uneven, for some organization webs, especially
Those are not only closely bound up with the common people but also with the organization web of profit property, once there is its false counterfeit website, user is most
In the case of have no way of accurately differentiating its property., will if ownership place of the domain name of these organization webs in reality can be known
Be advantageous to user and accurate judgement is carried out to its authenticity;And for internet management department, carry out network rectification, information is received
During the routine inspections such as collection, mechanism verification, then need to answer to obtain the geographical position of the website by obtaining the geographical position of domain name
With scene, the geographical position ownership place of website mechanism is the important information of a website, and the missing of the information will be interconnection webmaster
Reason department supervised to website, carries out differentiation to website to accessing user brings bigger difficulty.
The information of a large amount of organization webs and imperfect on network at present, can not learn mechanism by intuitively site information
Ownership place, the web page contents of a large amount of organization webs lack geographical position mark in internet, trace it to its cause usually to realize
The illegal operation of website and omit accidentally when hiding geographical position, Web Hosting etc., and part body website is also weighed in the presence of name
Phenomena such as multiple and instruction geography information is born the same name.For example, in existing network environment, the letter of website offer can not be only relied on
Breath carrys out the true ownership place of the accurate judgement mechanism, and such as " Changsha County ", this place name can not only point to Hunan China province Changsha
The Changsha County in city, the Changsha County of Vietnam can also be pointed to, for another example " phoenix town ", 16 phoenix of the same name are only there is within Chinese territory
Phoenix town, each province in the whole nation is distributed in, is accurately to know if only the title by " phoenix people's government of township net " website
Not going out the signified phoenix town in the website is the phoenix town in which province, so can not also determine be which phoenix town government's machine
Network forming station.
In the prior art, for problems, typically entered by manually including the info web service platform of mechanism information
Row retrieval, but by many network informations included have mechanism information missing or incomplete situation, and pass through people
Human resources and cost are higher needed for work collection and arrangement site information, take relatively long, and there is also problem for real-time servicing renewal.
The content of the invention
In order to solve problems of the prior art, the present invention provides a kind of website row based on web crawlers of optimization
Political affairs belong to recognition methods, and this method belongs to automatic detection, save artificial judgement and processing time, timely collection data, carry
High data validity, Detection results are good.
The technical solution adopted in the present invention is a kind of administrative ownership recognition methods in the website based on web crawlers, described
Method comprises the following steps:
Step 1:Pending Web Site Queue is initialized, creates thread;
Step 2:According to the URL of pending website, the top-level domain URL of the URL is obtained;
Step 3:The homepage of the top-level domain URL is crawled, parsing obtains the content information of homepage;
Step 4:Judge whether include ownership place title in content information;If so, the then ownership place entitled pending website
Ownership place, return to step 2;If it is not, carry out in next step;
Step 5:Using name Entity recognition instrument, the keyword in content information is extracted, keyword is stored in the first keyword
Table, and by current URL labeled as having captured URL;Two level URL in current URL content is put into pending Web Site Queue, climbed
Taking the two level URL two level page, parsing obtains the content information of the two level page, and by the key in the content information of the two level page
Word is stored in the second key table;
Step 6:First key table and the second key table are matched;Judge the first key table and the second key table
In whether include ICP information, if so, the place-saving according to where ICP information acquisitions current domain name, matched pair by syntactic analysis
The districts and cities' information answered, obtain the ownership place of website, return to step 2;If it is not, carry out in next step;
Step 7:By searching in the first key table and the second key table whether include the information related to address, to address
Information carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website, return to step 2.
Preferably, in the step 2, if web page contents climb information comprising counter during crawling, URL is put back to pending
In Web Site Queue, and carry out counter climbing processing.
Preferably, it is described it is counter climb processing include switch IP and reduction request rate.
Preferably, in the step 3, the content information of homepage includes name, place name, mechanism name, postcode and fax telephone.
Preferably, in the step 6, ICP information includes organization, website ICP credit numbers, web site name and website
Homepage.
The invention provides a kind of administrative ownership recognition methods in the website based on web crawlers of optimization, by using network
Reptile crawls the URL of any pending website and then crawls the homepage of the top-level domain URL, and parsing obtains the content letter of homepage
Breath, judges whether include ownership place title in content information, if any then Direct Recognition, if nothing, uses name Entity recognition work
Tool, the information for extracting the keyword in content information and subordinate URL are stored in the first key table and the second key table and matched,
Using place-saving where the current domain name of ICP information acquisitions, or search in the first key table and the second key table with address phase
The information of pass simultaneously carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website.This method utilizes
Crawler technology, automatic detection and comparison are carried out to website URL, save artificial judgement and processing time, timely collection data,
Data validity is improved, Detection results are good.
Brief description of the drawings
Fig. 1 is the flow chart of the present invention.
Embodiment
The present invention is described in further detail with reference to embodiment, but protection scope of the present invention is not limited to
This.
The present invention relates to a kind of administrative ownership recognition methods in website based on web crawlers, methods described includes following step
Suddenly.
In the present invention, info web is crawled by web crawlers, and web crawlers is otherwise known as webpage spider or net machine
People or webpage follower, it is a kind of according to certain rule, automatically captures the program or script of web message, Ke Yitong
Cross website URL and crawl web site contents.
Step 1:Initialization process Web Site Queue, create thread.
Step 2:According to the URL of pending website, the top-level domain URL of the URL is obtained;
In the step 2, if web page contents climb information comprising counter during crawling, URL is put back in pending Web Site Queue,
And carry out counter climbing processing.
It is described it is counter climb processing include switch IP and reduction request rate.
In the present invention, for example, the second level domain of Shandong university is http://www.view.sdu.edu.cn/,
Top-level domain corresponding to the domain name is sdu.edu.cn, therefore, by http://www.sdu.edu.cn/ and http://
Www.view.sdu.edu.cn/ is put into url queues to be crawled.
In the present invention, it is counter climb processing refer to some websites by limiting IP or the number of request in the particular account number unit interval
Amount accesses to limit.Generally, can be completed by switching IP and reducing request rate.
Step 3:The homepage of the top-level domain URL is crawled, parsing obtains the content information of homepage.
In the step 3, the content information of homepage includes name, place name, mechanism name, postcode and fax telephone.
In the present invention, fully crawling for the content information of homepage is advantageous to improve the extraction efficiency to organization web ownership place
With extraction accuracy.
Step 4:Judge whether include ownership place title in content information;If so, the then entitled pending net of the ownership place
The ownership place stood, return to step 2;If it is not, carry out in next step.
Step 5:Using name Entity recognition instrument, the keyword in content information is extracted, keyword deposit first is closed
Key word table, and by current URL labeled as having captured URL;Two level URL in current URL content is put into pending website team
Row, the two level URL two level page being crawled, parsing obtains the content information of the two level page, and by the content information of the two level page
Keyword be stored in the second key table.
In the present invention, name Entity recognition is also referred to as " proper name identification ", and referring to identify has the reality of certain sense in text
Body, mainly including name, place name, mechanism name, proper noun etc., such as " Changsha County " be show geographical position there is specific meaning
The entity of justice.
Step 6:First key table and the second key table are matched;Judge the first key table and the second key
Whether ICP information is included in word table, if so, the place-saving according to where ICP information acquisitions current domain name, is matched by syntactic analysis
Districts and cities' information, obtains the ownership place of website, return to step 2 corresponding to going out;If it is not, carry out in next step.
In the step 6, ICP information includes organization, website ICP credit numbers, web site name and website homepage.
In the present invention, syntactic analysis(Parsing)Refer to analyze the word grammatical function in sentence, such as " I
It is late ", " I " is subject here, and " next " is predicate, and " evening " is complement.
In the present invention, using the organization names got, address information etc., entity recognition techniques are named in application again
Keyword in content is identified, and by syntactic analysis, effective geographical location information is extracted, so as to Perfect mechanism
The true ownership place of website.
Step 7:It is right by searching in the first key table and the second key table whether include the information related to address
Address information carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website, return to step 2.
In the present invention, the related information in address includes the corresponding keywords such as contact method, organization.
The present invention crawls the URL of any pending website by using web crawlers and then crawls the top-level domain URL
Homepage, parsing obtain homepage content information, judge whether include ownership place title in content information, if any, then directly knowledge
Not, if nothing, using name Entity recognition instrument, the information for extracting the keyword in content information and subordinate URL is stored in first
Key table and the second key table simultaneously match, and using place-saving where the current domain name of ICP information acquisitions, or search the first key
The information related to address and syntactic analysis is carried out in word table and the second key table, match matching degree highest jointly
Location, obtain the ownership place of website.This method utilizes crawler technology, and automatic detection and comparison are carried out to website URL, and saving is manually sentenced
Fixed and processing time, timely collection data improve data validity, and Detection results are good.
Claims (5)
- A kind of 1. administrative ownership recognition methods in the website based on web crawlers, it is characterised in that:It the described method comprises the following steps:Step 1:Pending Web Site Queue is initialized, creates thread;Step 2:According to the URL of pending website, the top-level domain URL of the URL is obtained;Step 3:The homepage of the top-level domain URL is crawled, parsing obtains the content information of homepage;Step 4:Judge whether include ownership place title in content information;If so, the then ownership place entitled pending website Ownership place, return to step 2;If it is not, carry out in next step;Step 5:Using name Entity recognition instrument, the keyword in content information is extracted, keyword is stored in the first keyword Table, and by current URL labeled as having captured URL;Two level URL in current URL content is put into pending Web Site Queue, climbed Taking the two level URL two level page, parsing obtains the content information of the two level page, and by the key in the content information of the two level page Word is stored in the second key table;Step 6:First key table and the second key table are matched;Judge the first key table and the second key table In whether include ICP information, if so, the place-saving according to where ICP information acquisitions current domain name, matched pair by syntactic analysis The districts and cities' information answered, obtain the ownership place of website, return to step 2;If it is not, carry out in next step;Step 7:By searching in the first key table and the second key table whether include the information related to address, to address Information carries out syntactic analysis, matches matching degree highest common address, obtains the ownership place of website, return to step 2.
- A kind of 2. administrative ownership recognition methods in website based on web crawlers according to claim 1, it is characterised in that:Institute State in step 2, if web page contents climb information comprising counter during crawling, URL is put back in pending Web Site Queue, and is carried out It is counter to climb processing.
- A kind of 3. administrative ownership recognition methods in website based on web crawlers according to claim 2, it is characterised in that:Institute State it is counter climb processing include switch IP and reduction request rate.
- A kind of 4. administrative ownership recognition methods in website based on web crawlers according to claim 1, it is characterised in that:Institute State in step 3, the content information of homepage includes name, place name, mechanism name, postcode and fax telephone.
- A kind of 5. administrative ownership recognition methods in website based on web crawlers according to claim 1, it is characterised in that:Institute State in step 6, ICP information includes organization, website ICP credit numbers, web site name and website homepage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710866237.1A CN107590265A (en) | 2017-09-22 | 2017-09-22 | A kind of administrative ownership recognition methods in the website based on web crawlers |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710866237.1A CN107590265A (en) | 2017-09-22 | 2017-09-22 | A kind of administrative ownership recognition methods in the website based on web crawlers |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107590265A true CN107590265A (en) | 2018-01-16 |
Family
ID=61047544
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710866237.1A Pending CN107590265A (en) | 2017-09-22 | 2017-09-22 | A kind of administrative ownership recognition methods in the website based on web crawlers |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107590265A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108494829A (en) * | 2018-02-26 | 2018-09-04 | 福建三鑫隆信息技术开发股份有限公司 | A kind of method for safety monitoring and system |
WO2020024896A1 (en) * | 2018-08-03 | 2020-02-06 | 上海点融信息科技有限责任公司 | Blockchain data search method and device and storage medium |
CN111104579A (en) * | 2019-12-31 | 2020-05-05 | 北京神州绿盟信息安全科技股份有限公司 | Identification method and device for public network assets and storage medium |
CN112257032A (en) * | 2019-10-21 | 2021-01-22 | 国家计算机网络与信息安全管理中心 | Method and system for determining APP responsibility subject |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105743921A (en) * | 2016-04-08 | 2016-07-06 | 安徽电信规划设计有限责任公司 | Site information management method for IDC machine room |
CN106096040A (en) * | 2016-06-29 | 2016-11-09 | 中国人民解放军国防科学技术大学 | Organization web ownership place method of discrimination based on search engine and device thereof |
-
2017
- 2017-09-22 CN CN201710866237.1A patent/CN107590265A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105743921A (en) * | 2016-04-08 | 2016-07-06 | 安徽电信规划设计有限责任公司 | Site information management method for IDC machine room |
CN106096040A (en) * | 2016-06-29 | 2016-11-09 | 中国人民解放军国防科学技术大学 | Organization web ownership place method of discrimination based on search engine and device thereof |
Non-Patent Citations (2)
Title |
---|
张季谦等: "《网页设计与制作》", 31 January 2017 * |
杨明刚: "《大数据时代的网络舆情》", 30 June 2017 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108494829A (en) * | 2018-02-26 | 2018-09-04 | 福建三鑫隆信息技术开发股份有限公司 | A kind of method for safety monitoring and system |
WO2020024896A1 (en) * | 2018-08-03 | 2020-02-06 | 上海点融信息科技有限责任公司 | Blockchain data search method and device and storage medium |
CN112257032A (en) * | 2019-10-21 | 2021-01-22 | 国家计算机网络与信息安全管理中心 | Method and system for determining APP responsibility subject |
CN112257032B (en) * | 2019-10-21 | 2023-07-14 | 国家计算机网络与信息安全管理中心 | Method and system for determining APP responsibility main body |
CN111104579A (en) * | 2019-12-31 | 2020-05-05 | 北京神州绿盟信息安全科技股份有限公司 | Identification method and device for public network assets and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104899508B (en) | A kind of multistage detection method for phishing site and system | |
CN103218431B (en) | A kind ofly can identify the system that info web gathers automatically | |
CN106096040B (en) | Organization web ownership place method of discrimination and its device based on search engine | |
CN101534306B (en) | Detecting method and a device for fishing website | |
CN102841920B (en) | Method and device for extracting webpage frame information | |
CN107590265A (en) | A kind of administrative ownership recognition methods in the website based on web crawlers | |
US8682882B2 (en) | System and method for automatically identifying classified websites | |
CN106095979B (en) | URL merging processing method and device | |
CN104516949B (en) | Web data treating method and apparatus, inquiry processing method and question answering system | |
CN106383887A (en) | Environment-friendly news data acquisition and recommendation display method and system | |
JP2012500427A (en) | Providing regional content by matching geographic characteristics | |
CN102483756A (en) | An assistant-adviser using the semantic analysis of community exchanges | |
CN107590236B (en) | Big data acquisition method and system for building construction enterprises | |
CN102073960A (en) | Method for assessing operation effect in website marketing process | |
CN108092963A (en) | Web page identification method, device, computer equipment and storage medium | |
CN105975523A (en) | Hidden hyperlink detection method based on stack | |
US20130179421A1 (en) | System and Method for Collecting URL Information Using Retrieval Service of Social Network Service | |
CN103338260A (en) | Distributed analytical system and analytical method for URL logs in network auditing | |
CN107800686A (en) | A kind of fishing website recognition methods and device | |
CN110020161B (en) | Data processing method, log processing method and terminal | |
CN106446123A (en) | Webpage verification code element identification method | |
Xu et al. | Multi-modal description of public safety events using surveillance and social media | |
WO2015074455A1 (en) | Method and apparatus for computing url pattern of associated webpage | |
Tyner et al. | Tweeting the Laurentian Great Lakes: A community opinion analysis about Great Lakes areas as assessed through mentions on Twitter | |
US20220292253A1 (en) | Automated structured data object creation and location integration into multiple location applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180116 |