CN104065532A

CN104065532A - Unrecorded website search method and system based on multi-channel data access method

Info

Publication number: CN104065532A
Application number: CN201410299875.6A
Authority: CN
Inventors: 王勇; 朱春鸽; 周润林; 丁国栋; 杨书童
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2014-06-26
Filing date: 2014-06-26
Publication date: 2014-09-24
Anticipated expiration: 2034-06-26
Also published as: CN104065532B

Abstract

The invention provides an unrecorded website search method and system based on the multi-channel data access method. The method comprises the following steps: domain names are acquired through the multi-channel data access method, and unrecorded domain names are screened out to form a domain name seed library; DNS analysis is performed on the unrecorded domain names to acquire corresponding IP addresses; the IP address are located to obtain an unrecorded domain name library; and unrecorded website information is obtained through activity verification. According to the method and the system provided in the invention, through the multi-channel data access method, the accuracy and comprehensiveness of the finally-obtained unrecorded website information can be ensured, the achievement has been verified in an unrecorded website finding and multi-language website recognition system; according to the invention, the polling mechanism is used, the modules operate all the time simultaneously, so the finally-obtained unrecorded website information can be ensured to be always up to data.

Description

A kind of not recorded website search method and system based on multichannel data access way

Technical field

The present invention relates to not recorded website and seek business, specifically relate to a kind of not recorded website search method and system based on multichannel data access way.

Background technology

Put on record the main task of management system of ICP/IP address/domain name information is the relevant information of collecting domestic IC P website and IP address, realize the standardized administration to ICP/IP, for network security management and the information security of national the Internet are monitored the means that quick location is provided, for relevant functional departments provide decision basis.

Recorded website does not find that subsystem is the ICP/IP address/domain name information subsystem of management system of putting on record.

Recorded website does not find that the main task of subsystem is that the legitimacy that realizes the certificate of the ICP website of having put on record detects, automatic discovery, location and the statistics of the new ICP website of not putting on record, being the ICP/IP address/domain name information management system basic data supplier that puts on record, is that the smooth data basis of work is rectified in ICP website.

The business of Data Source find to(for) recorded website not has plurality of access modes, as initiatively discovery of reptile, domain name journal file etc.Single data access cannot ensure the comprehensive, accurate of data, and any data access all has its advantage and defect.

For reptile is initiatively found access way, crawl more domain name by seed bank, advantage is to utilize the mutual chain structure feature of network of the Internet, utilizes limited resource to capture more domain name; Weak point is cannot ensure to capture the comprehensive and promptness of domain name, if this domain name, on isolated island, will can not be found so.

For domain name daily record access way, because data obtain from the daily record of domestic main flow name server, advantage is apparent, this kind of ageing height of new domain name that access way obtains, and can solve the problem of isolated island domain name; Weak point is main domain name mapping service all to be contained, and domain name must someone be accessed simultaneously, if name server can not entirely contain or domain name is not accessed by people, will lose a large amount of domain names so.

Therefore, obtain by multichannel data access way, the discovery data of not putting on record that complement each other, thus improve the overall accuracy of putting work on record.

Summary of the invention

In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of not recorded website search method and system based on multichannel data access way.By multichannel data access way automatic search the Internet, find the independent domain name of website at home, IP address, whether this domain name is put on record and is detected to the system of putting on record, and will be not recorded website information pushing give its direct access service supplier, and then improve the ICP website rate of putting on record.

In order to realize foregoing invention object, the present invention takes following technical scheme:

An aspect of of the present present invention, provides a kind of not recorded website search method based on multichannel data access way, it is characterized in that, described method comprises the steps:

A. obtain domain name by multichannel data access way, filter out the domain name of not putting on record and form domain name seed bank;

B. the domain name of not putting on record is carried out to dns resolution, obtain corresponding IP address;

C. locate IP address, draw the domain name storehouse of not putting on record;

D. verify by activity the not recorded website information that draws.

Preferably, in steps A, described multichannel data access way comprises web crawlers access way and domain name daily record access way; Obtaining domain name by web crawlers access way comprises the steps:

A-1-1. choose the addressable website of domestic 100,000 magnitudes as kind of a subdomain name;

A-1-2. by web crawlers, in the webpage grabbing, extract domain name;

A-1-3. the domain name grabbing and existing domain name seed bank are compared, duplicate removal;

A-1-4. the domain name after duplicate removal is added to domain name seed bank, enters next round circulation;

Obtaining domain name by domain name daily record access way comprises the steps:

A-2-1. obtain original domain name journal file and gather from domestic main flow name server; Described main flow name server comprises each province name server and international export name server;

A-2-2. format described original domain name journal file, find out in domain name journal file every and record corresponding TLD;

A-2-3. the described TLD in steps A-2-2 and existing domain name seed bank are compared, duplicate removal;

A-2-4. the domain name after duplicate removal is added to domain name seed bank, enters next round circulation.

Preferably, step B comprises the steps:

B-1. from domain name seed bank, extract formatted website TLD;

B-2. by name server, the TLD obtaining in step B-1 is done to domain name mapping; The number of domain name server is greater than 1;

B-3. common factor is got in the IP address different name servers being obtained, and obtains IP address corresponding to domain name.

Preferably, step C comprises the steps:

C-1. IP address record information table is loaded into internal memory;

C-2. an IP address and IP address record information table information are compared, located, obtain operator corresponding to this IP address, province and direct connector's information, and get rid of the corresponding domain name of IP that cannot locate;

C-3. repeating step C-2, until finish the domain name of not putting on record described in obtaining storehouse.

Preferably, step D comprises the steps:

D-1. generate the assignment file that comprises domain-name information;

D-2. utilize multithreading to gather webpage corresponding to domain name;

D-3. judge the activity of website according to collection result; Described judge comprise: if the conditional code of the HTTP message returning be 200 and can download to normal webpage, judge that website is movable; Otherwise be inactive.

Preferably, described method comprises the not recorded website information drawing in investigation step D; Described investigation comprises:

(1) blocking-up page checking; If certain domain name is jumped to the interception page by ISP, this domain name is never rejected in recorded website information;

(2) IP intelligent correction; If domain name mapping is corrected as the IP of ISP node automatically, this domain name is never rejected in recorded website information;

(3) domain name duplicate removal; Compare with recorded website information, will in recorded website information and recorded website information not, occur that the domain name of occuring simultaneously never rejects in recorded website information.

Another aspect of the present invention, provide a kind of not recorded website based on multichannel data access way to seek system, it is characterized in that, described system comprises: data access module, dns resolution module, IP locating module, activity authentication module and not recorded website information data generation module;

Described data access module is obtained domain name, filters out the domain name of not putting on record and forms domain name seed bank;

Described dns resolution module is carried out dns resolution to the domain name of not putting on record, obtains corresponding IP address;

IP address, described IP locating module location, draws the domain name storehouse of not putting on record;

Described activity authentication module carries out activity checking to website;

Described not recorded website information data generation module draws not recorded website information.

Preferably, described data access module comprises web crawlers access module and domain name daily record access module; Described web crawlers access module comprises: data download module, data analysis module and data duplicate removal module; Described data download module is downloaded the data on WEB server; Described data analysis module is analyzed the external linkage comprising in data source code; Described data duplicate removal module in the domain name grabbing, reject seed bank in already present domain name.

Preferably, domain name daily record access module comprises: providing data formatting module and data duplicate removal module; Described providing data formatting module format original domain name journal file.

Preferably, described system comprises investigation module, investigates not recorded website information; Described investigation comprises: the checking of the blocking-up page, IP intelligent correction and domain name duplicate removal.

Compared with prior art, beneficial effect of the present invention is:

(1) by multichannel data access way, can guarantee the not recorded website information that finally obtains accurately, comprehensively, this achievement recorded website not find and multi-language website recognition system in be verified;

(2) in domain name mapping process, by multiple name servers, same domain name is resolved, analysis result is got to this mode of occuring simultaneously, can improve validity, the correctness of analysis result;

(3), in activity proof procedure, by the dual effect with the web data downloading to HTTP message status code, can improve the accuracy of activity judgement; In addition, increase activity authentication module, reject not recorded website of a part, make the not recorded website information that finally obtains more accurate and effective;

(4) the data duplicate removal module that the present invention uses was carried out duplicate removal, and is not relied on the duplicate removal of database before the domain name that multichannel data access way is obtained enters domain name seed bank, can greatly reduce the pressure of ground database;

(5) the present invention uses polling mechanism, and modules while, operation always, can ensure that the not recorded website information finally obtaining is up-to-date all the time.

Brief description of the drawings

Fig. 1 is that the not recorded website that the present invention is based on multichannel data access way is sought system results figure;

Fig. 2 is the flow chart that in the inventive method, reptile is initiatively found domain name;

Fig. 3 is the flow chart of domain name log acquisition domain name in the inventive method;

Fig. 4 is the flow chart of domain name mapping in the inventive method;

Fig. 5 is IP positioning flow figure in the inventive method;

Fig. 6 is the flow chart of activity checking in the inventive method.

Embodiment

The not recorded website discover method based on multichannel data access way that this patent proposes and the application scenarios of system include but not limited to following several situation:

Recorded website is not found;

Site information statistics (as, website operator, affiliated branch center, direct connector, web site activity, website category of language etc.);

Below in conjunction with " Figure of description " and instantiation, the present invention is described in detail.

Fig. 1 is the structure chart of system of the present invention, and this system comprises: web crawlers access module, domain name daily record access module, dns resolution module, IP locating module, activity authentication module, investigation module and not recorded website information data generation module.

Method of the present invention mainly comprises the steps:

1. obtain domain name by multichannel data access way, the domain-name information of putting on record providing in conjunction with the system of putting on record carries out Preliminary screening, forms the domain name seed bank of not putting on record;

2. domain name in is 1. carried out to dns resolution, obtain corresponding IP address;

3. IP in is 2. carried out to IP status in address, obtain preliminary domestic domain name storehouse of not putting on record (getting rid of not carrying out domain name corresponding to IP address location);

4. gained domain name in is 3. carried out to detection of activity, obtain final not recorded website information.

Concrete technical scheme is as follows:

One, seek module (comprising web crawlers access module and domain name daily record access module) by multichannel data access way and find new domain name, form the domain name seed bank of not putting on record.

obtain domain name by web crawlers

The groundwork flow process of web crawlers is as shown in " Figure of description " Fig. 2:

1. first choose domestic, websites that can access, 100,000 magnitudes as kind of a subdomain name (URL);

2. these URL are put into URL queue to be captured;

3. from URL queue to be captured, take out and wait to capture at URL, resolve DNS, and obtain the IP of main frame, and page download corresponding URL is got off, be stored in downloading web pages storehouse.In addition, these URL are put into and capture URL queue.

4. analyze the URL having captured in URL queue, analyze other link URL wherein, and other link URL are put into URL queue to be captured, thereby enter next circulation.

In the not recorded website discover method and system based on multichannel data access way, first from the domain name of putting on record, select movable domain name as seed URL, by web crawlers, in the webpage grabbing, extract new URL; Secondly, by the URL grabbing and existing domain name seed bank compare, duplicate removal, obtain new URL; Finally, new URL is added in domain name seed bank, enter next round circulation.

This step relates to three nucleus modules: data download, data analysis, data duplicate removal.

Wherein, data are downloaded, and with the active scope basis by name of putting on record in domain name, download the data on the WEB server in Internet according to http protocol, obtain web data, and main purpose is to provide data basis for the content in analyzing web page;

Comprise the steps:

1) connect WEB server;

2) send HTTP request to WEB server;

3) receive the result that WEB server returns;

4) analyze the header that HTTP returns;

5) successfully receive so if returned the data content returning.

Data analysis, downloading the web data obtaining taking data is basis, analyzes the external linkage comprising in source code.

Data duplicate removal, general principle is to adopt HASH algorithm, a character string is calculated to be to the number of a DWORD type, distinguishes the difference of character string according to the difference of numerical value.The external linkage obtaining according to data analysis, adopts HASH algorithm to generate characteristic of correspondence value domain name, in conjunction with duplicate removal log file, rejects already present domain name in seed bank, retains new domain name.

by domain name log acquisition domain name

Its groundwork flow process is as shown in " Figure of description " Fig. 3:

1. obtain original domain name journal file from domestic main flow name server (mainly containing each province name server, international export name server), gather;

2. format domain name journal file obtained in the previous step, finds out in domain name journal file every and records corresponding TLD;

3. domain name obtained in the previous step and original domain name seed bank are compared, duplicate removal, obtain new domain name;

Domain name obtained in the previous step is added in domain name seed bank, enter next round circulation.

This step relates to two nucleus modules: providing data formatting, data duplicate removal.

Wherein, providing data formatting, is exactly by standards on domain name, the TLD form that comprises domain name recording of information and be formatted into standard obtaining from domain name journal file according to international domain identifier brigadier, as, music.baidu.com → baidu.com.

Data duplicate removal, principle, content are similar to " reptile obtains domain name ".

Two, by dns resolution module, domain name is carried out to dns resolution, obtain corresponding IP

Key step comprises:

1. from domain name seed bank, extract formatted website TLD;

2. by multiple name servers, domain name obtained in the previous step is done to domain name mapping;

3. common factor is got in the IP address different name servers being obtained, and obtains IP address corresponding to domain name.

Its groundwork flow process is as shown in " Figure of description " Fig. 4:

1) connect dns server;

2) send request to dns server;

3) receive the result that dns server returns;

4) analyze the result that dns server is resolved, get the common factor of different dns resolution results;

5) if there is common factor, obtain IP list corresponding to domain name.

Three, do IP location by IP locating module to completing the domain name of dns resolution, form domestic domain name storehouse of not putting on record

Its basic workflow is as shown in " Figure of description " Fig. 5:

1. IP address record information table is loaded into internal memory;

2. get an IP address, compare, locate with IP address record information table information, according to the record in IP address record information table corresponding to this IP address, obtain operator corresponding to this IP address, province, direct connector; Get rid of carrying out domain name corresponding to IP address location;

3. repeat above-mentioned the 2nd step, until finish, obtain preliminary domestic domain name storehouse of not putting on record.

Remarks: IP address record information table, comprises operator corresponding to IP address, province, direct connector's information.

Four, by activity authentication module, the domestic domain name of not putting on record is done to detection of activity, form finally not recorded website information

Its basic workflow is as shown in " Figure of description " Fig. 6:

1. generate the assignment file that comprises domain-name information;

2. utilize multithreading to gather webpage corresponding to domain name;

3. the activity of carrying out website according to collection result judges: if the conditional code of the HTTP message returning be 200 and can download to normal webpage, judge that website is movable; Otherwise be inactive.

Remarks: after carrying out the detection of activity of website, in order to obtain more accurately not recorded website information, simultaneously added following investigation module:

A. block page checking, if certain domain name is jumped to the interception page by ISP (ISP), this domain name is never rejected in recorded website information;

B.IP intelligent correction, under certain conditions (as, user network condition is poor, browser rs cache is made mistakes, Website server visit capacity is excessive, browser-incompatible), the IP that domain name mapping is corrected as ISP node automatically, never rejects this domain name in recorded website information;

C. compare with the up-to-date information of recorded website, will in recorded website information and recorded website information not, occur that the domain name of occuring simultaneously never rejects in recorded website information.

Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although the present invention is had been described in detail with reference to above-described embodiment, those of ordinary skill in the field are to be understood that: still can modify or be equal to replacement the specific embodiment of the present invention, and do not depart from any amendment of spirit and scope of the invention or be equal to replacement, it all should be encompassed in the middle of claim scope of the present invention.

Claims

1. the not recorded website search method based on multichannel data access way, is characterized in that, described method comprises the steps:

C. locate IP address, draw the domain name storehouse of not putting on record;

D. verify by activity the not recorded website information that draws.

2. the method for claim 1, is characterized in that, in steps A, described multichannel data access way comprises web crawlers access way and domain name daily record access way; Obtaining domain name by web crawlers access way comprises the steps:

A-1-2. by web crawlers, in the webpage grabbing, extract domain name;

3. the method for claim 1, is characterized in that, step B comprises the steps:

B-1. from domain name seed bank, extract formatted website TLD;

4. the method for claim 1, is characterized in that, step C comprises the steps:

C-1. IP address record information table is loaded into internal memory;

5. the method for claim 1, is characterized in that, step D comprises the steps:

D-1. generate the assignment file that comprises domain-name information;

D-2. utilize multithreading to gather webpage corresponding to domain name;

6. the method for claim 1, is characterized in that, described method comprises the not recorded website information drawing in investigation step D; Described investigation comprises:

7. the not recorded website based on multichannel data access way is sought a system, it is characterized in that, described system comprises: data access module, dns resolution module, IP locating module, activity authentication module and not recorded website information data generation module;

8. system as claimed in claim 7, is characterized in that, described data access module comprises web crawlers access module and domain name daily record access module; Described web crawlers access module comprises: data download module, data analysis module and data duplicate removal module; Described data download module is downloaded the data on WEB server; Described data analysis module is analyzed the external linkage comprising in data source code; Described data duplicate removal module in the domain name grabbing, reject seed bank in already present domain name.

9. system as claimed in claim 7, is characterized in that, domain name daily record access module comprises: providing data formatting module and data duplicate removal module; Described providing data formatting module format original domain name journal file.

10. system as claimed in claim 7, is characterized in that, described system comprises investigation module, investigates not recorded website information; Described investigation comprises: the checking of the blocking-up page, IP intelligent correction and domain name duplicate removal.