CN101488140B - Method and apparatus for confirming website type - Google Patents

Method and apparatus for confirming website type Download PDF

Info

Publication number
CN101488140B
CN101488140B CN2008101858391A CN200810185839A CN101488140B CN 101488140 B CN101488140 B CN 101488140B CN 2008101858391 A CN2008101858391 A CN 2008101858391A CN 200810185839 A CN200810185839 A CN 200810185839A CN 101488140 B CN101488140 B CN 101488140B
Authority
CN
China
Prior art keywords
network resource
website
domain name
resource identifier
website domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2008101858391A
Other languages
Chinese (zh)
Other versions
CN101488140A (en
Inventor
张国强
陈晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xunlei Network Technology Co Ltd
Original Assignee
Shenzhen Xunlei Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xunlei Network Technology Co Ltd filed Critical Shenzhen Xunlei Network Technology Co Ltd
Priority to CN2008101858391A priority Critical patent/CN101488140B/en
Publication of CN101488140A publication Critical patent/CN101488140A/en
Application granted granted Critical
Publication of CN101488140B publication Critical patent/CN101488140B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention relates to a network communication technique, in particular to a method of determining the website type and a device thereof. The network communication technique is used for solving the problem in the prior art that when to determine a website type, contents of every website are required to be taken down to be analyzed and vast storing space and calculation amount are required. In the method of the embodiment of the invention, the corresponding relationship between a website domain name and network resource marks is determined; all corresponding network resource marks of the website domain name and pre-established network resource mark aggregation are matched; when in the corresponding network resource marks of the website domain name, the ratio of the matched network resource marks to the corresponding network resource marks of the website domain name is larger than a first threshold, the corresponding website type of the website domain name is determined to be the website type corresponding to the network resource mark aggregation. With the method of the embodiment of the invention adopted, the storing space and the calculation amount can be lowered.

Description

A kind of method and apparatus of definite Type of website
Technical field
The present invention relates to the network communications technology, particularly a kind of method and apparatus of definite Type of website.
Background technology
The Internet resources search system is a kind of system that various resources on the Fast Searching I nternet (internet) are provided to the user.Resource on the Internet comprises contents such as various digital musics, video display, software, books, exists with various file layout.
The user imports the key word that oneself needs the Internet resources of download in the Internet resources search system, just can obtain download address, and then download.
Because the automatic search for networks resource of Internet resources search system machine can search some illegal resources (such as pirated film, pornographic resource or the like) unavoidably, the user will find these resources when submitting to correspondent keyword to inquire about thus.
If can determine the Type of website of website, just can not show the network address of the website of containing illegal resource to the user.
At present, a kind of method of analyzing based on web site contents is arranged, can determine the Type of website of website.
With pornographic resource is example, for a website that pornographic resource is arranged, has a large amount of pornographic class keywords usually.At this moment the distribution situation by pornographic class keyword in the statistical study web site contents just, thus identify Pornography Sites, thus the Type of website of definite website.
Said method, owing to need the content of each website to grasp and analyze, this just needs a large amount of storage space and calculated amount.
In sum, for the Type of website of determining the website, the content of each website all need be grasped and analyze at present, thereby need a large amount of storage space and calculated amount.
Summary of the invention
The embodiment of the invention provides a kind of method and apparatus of definite Type of website, in order to solve exist in the prior art for definite Type of website, the content of each website all need be grasped and analyze, thereby need a large amount of storage spaces and the problem of calculated amount.
The method of a kind of definite Type of website that the embodiment of the invention provides comprises:
Determine the corresponding relation of website domain name and network resource identifier;
All described network resource identifiers of described website domain name correspondence and the network resource identifier set of setting up are in advance mated, wherein network resource identifier is the content signature CID that is used to identify file in download, and described CID calculates the back according to default algorithm to the content-data of binary file to obtain;
In the all-network resource identification of described website domain name correspondence, the ratio of all-network resource identification that network resource identifier on the coupling accounts for described website domain name correspondence is during greater than first threshold, and the Type of website of determining described website domain name correspondence is the corresponding Type of website of described network resource identifier set.
The device of a kind of definite Type of website that the embodiment of the invention provides comprises:
The corresponding relation determination module is used for determining the corresponding relation of website domain name and network resource identifier;
Matching module, be used for all described network resource identifiers of described website domain name correspondence and the network resource identifier set of setting up are in advance mated, wherein network resource identifier is the content signature CID that is used to identify file in download, and described CID calculates the back according to default algorithm to the content-data of binary file to obtain;
Processing module, be used for all-network resource identification in described website domain name correspondence, the ratio of all-network resource identification that network resource identifier on the coupling accounts for described website domain name correspondence is during greater than first threshold, and the Type of website of determining described website domain name correspondence is the corresponding Type of website of described network resource identifier set.
The embodiment of the invention is determined the corresponding relation of website domain name and network resource identifier; All described network resource identifiers of described website domain name correspondence and the network resource identifier set of setting up are in advance mated; In the all-network resource identification of described website domain name correspondence, the ratio of all-network resource identification that network resource identifier on the coupling accounts for described website domain name correspondence is during greater than first threshold, and the Type of website of determining described website domain name correspondence is the corresponding Type of website of described network resource identifier set.Owing to the Internet resources whether monitoring is arranged in the website that can determine website domain name correspondence, thereby do not need the content of each website is all grasped and analyzes, just can determine whether the website provides the website of illegal resource, reduced and handled required storage space and calculated amount, improved processing speed and treatment effeciency.
Description of drawings
Fig. 1 determines the apparatus structure synoptic diagram of the Type of website for the embodiment of the invention;
Fig. 2 determines the method flow synoptic diagram of the Type of website for the embodiment of the invention.
Embodiment
The embodiment of the invention is set up the set of network resource identifier in advance, determine the corresponding relation of website domain name and network resource identifier after, all described network resource identifiers of described website domain name correspondence and the network resource identifier set of setting up are in advance mated; In the all-network resource identification of described website domain name correspondence, the ratio of all-network resource identification that network resource identifier on the coupling accounts for described website domain name correspondence is during greater than first threshold, the Type of website of determining described website domain name correspondence is the corresponding Type of website of described network resource identifier set, owing to do not need the content of each website is all grasped and analyzes, handle required storage space and calculated amount thereby reduced.
Wherein, network resource type includes but not limited to one or more in following:
Video display, music, software, recreation or the like.
Network resource identifier, promptly (Content Identity CID) is used to identify downloaded files to the content signature.CID calculates the back according to default algorithm to the content-data of binary file to obtain.Default algorithm can be that the content-data of different binary files is handled the arbitrary algorithm that obtains different results, its result (be content signature) but the unique identification binary file, perhaps, also can be that the result repetition rate is extremely low, the algorithm in tolerance interval.
The corresponding relation of website domain name and network resource identifier can obtain from download message; Also can preestablish the corresponding relation of website domain name and network resource identifier.
Download message comprises address (such as URL (Uniform Resource Locator, uniform resource locator)) address and the network resource identifier of down loading network resource.
The embodiment of the invention can be passed through P2SP (Point To Server Point, point is to server and point) download technology and obtain download message.
The P2SP download technology is a kind ofly to improve the technology of speed of download by multicast communication, owing to when using the P2SP technology to download, need carry out communication with server, has therefore stored a large amount of download messages at server end.
Need illustrate that the mode that the embodiment of the invention is obtained download message is not limited to the P2SP download technology, other modes that can obtain download message are suitable equally.
Below in conjunction with Figure of description the embodiment of the invention is described in further detail.
As shown in Figure 1, the embodiment of the invention determines that the device of the Type of website comprises: corresponding relation determination module 10, matching module 20 and processing module 30.
Corresponding relation determination module 10 is used for determining the corresponding relation of website domain name and network resource identifier.
Wherein, the corresponding relation of website domain name and network resource identifier can adopt database or file or other forms to store, and can store in the device of present embodiment, also can store into to supply the device of present embodiment to search in other entities.
If the employing database, then this database can be realized by the relevant database technology.Such as: on server, the relational data library software can be installed, and can adopt the application programming interfaces that relevant database manufacturer provides (Application Programming Interface, API).Generally in relevant database, (Structured Query Language is SQL) as the interface routine of management database content to adopt Structured Query Language (SQL).
Wherein, corresponding relation determination module 10 can further include: extraction module 100, modular converter 110 and corresponding relation building module 120.
Extraction module 100 is used for extracting download address and network resource identifier according to the time of setting from each download message that obtains.
With P2SP is example, when the user uses the P2SP technology to download, server end has been stored a large amount of download messages, because it is very fast that server increases the speed of download message, so download message of every increase is all analyzed obviously very high for the requirement of device, preferable mode is to set a time, the download message that server increases is analyzed, analyze once such as being set to 24 hours, the folder of can creating a file accordingly, with the download message after the analyzing and processing as in this document folder, be convenient to next processing can quick identification which be the download message that increases newly.
In specific implementation process, extraction module 100 is handled a download message, two tuples of a download address and network resource identifier will be from this download message, extracted, after all finishing dealing with, the two tuples set of download address and network resource identifier can be obtained accordingly.
Modular converter 110, each download address that is used for extraction module 100 is extracted is converted to the website domain name.
In specific implementation process, after modular converter 110 all is converted to the website domain name with each download address, can obtain the two tuples set of website domain name and network resource identifier.
Corresponding relation building module 120 is used for determining the corresponding relation of website domain name and network resource identifier.
Corresponding relation building module 120 extracts the all-network resource identification of same website domain name correspondence from the two tuples set of website domain name and network resource identifier, thereby determines the corresponding relation of website domain name and network resource identifier.
Concrete, corresponding relation building module 120 can be earlier that each different website domain name is distributed a group, in the group of network resource identifier as for the website domain name of correspondence during two tuples that will contain the website domain name are then gathered.
Such as<website domain name A, network resource identifier A 〉,<website domain name B, network resource identifier B〉and<website domain name A, network resource identifier C 〉.
Be website domain name A assign group 1, website domain name B assign group 2, then with network resource identifier A and network resource identifier C as in the group 1, network resource identifier B is as in the group 2.
So just determine the corresponding relation of website domain name and network resource identifier.
Because the representative of download message has a resource by user's download, can also count like this in the domain name of same website, the number of times that the Internet resources of a network resource identifier correspondence are downloaded, thus know the attention rate that is subjected to of corresponding website domain name.
Matching module 20, all described network resource identifiers that are used for website domain name correspondence that corresponding relation determination module 10 is determined mate with the network resource identifier set of setting up in advance.
Processing module 30, be used for all-network resource identification in website domain name correspondence, the ratio of all-network resource identification that network resource identifier on the coupling accounts for website domain name correspondence is during greater than first threshold, and the Type of website of determining website domain name correspondence is the corresponding Type of website of network resource identifier set.
Wherein, the corresponding relation of the network resource identifier set and the Type of website can adopt database or file or other forms to store, and can store in the device of present embodiment, also can store into to supply the device of present embodiment to search in other entities.
If the employing database, then this database can be realized by the relevant database technology.Such as: on server, the relational data library software can be installed, and the application programming interfaces that can adopt relevant database manufacturer to provide.Generally in relevant database, adopt the interface routine of Structured Query Language (SQL) as the management database content.
The corresponding relation of the network resource identifier set and the Type of website is that the network resource content of the network resource identifier correspondence in gathering according to network resource identifier determines, such as network resource content is pirated film, and then the corresponding Type of website of network resource identifier set is the pirated film website; Such as network resource content is pornofilm, and then the corresponding Type of website of network resource identifier set is the porn site.
Specifically which type of Type of website need be identified, corresponding network resource identifier set can be set up as required.
In specific implementation process, if the first threshold of setting is 10%, the number percent that computing module 50 is determined is 75, and the corresponding Type of website of network resource identifier set is the pirated film website, determine 75% greater than 10%, determine that then the Type of website of website domain name is the pirated film website.
Wherein, processing module 30 can further include: quantity determination module 300, computing module 310 and Type of website determination module 320.
Quantity determination module 300 is used for the matching result according to matching module 20, determines that the network resource identifier set contains the quantity of the network resource identifier of website domain name correspondence.
Such as in the set of: network resource identifier A, B, C and D being arranged, 4 network resource identifiers, the network resource identifier of website domain name 1 correspondence is B, C, D, E and F, then the quantity of the network resource identifier of website domain name 1 correspondence that contains of network resource identifier set is 3.
Computing module 310, the quantity that is used for the all-network resource identification of the quantity determined according to quantity determination module 300 and website domain name correspondence determines that the network resource identifier on the coupling of website domain name correspondence accounts for the number percent of the all-network resource identification of website domain name correspondence.
Such as: the network resource identifier of website domain name 1 correspondence is B, C, D, E and F, the quantity of the network resource identifier of website domain name 1 correspondence that the network resource identifier set contains is 3, and the number percent that the network resource identifier on the coupling of then definite website domain name correspondence accounts for the all-network resource identification of website domain name correspondence is 3/5=60%.
Type of website determination module 320 when being used for the number percent determined at computing module 310 greater than first threshold, determines that the corresponding Type of website of network resource identifier set is the Type of website of the website of website domain name correspondence.
Wherein, network resource identifier set can be found several websites of determining to contain the Internet resources that a large amount of needs monitor earlier, such as three websites, obtain the network resource identifier of the Internet resources in the website, and form the network resource identifier set, then the embodiment of the invention determines that the device of the Type of website can further include: module 40 is set up in set.
Module 40 is set up in set, is used for determining the identical a plurality of samples website of the Type of website, obtains Internet resources from the sample website, determines the network resource identifier of each Internet resources of acquisition, and is combined into the network resource identifier set.
Preferably, can check regularly whether the sample website also exists, and handles if exist then do not need, otherwise, can upgrade in the sample website of looking for other.
Because the quantity of the network resource identifier of a website domain name correspondence might determining seldom, such as several, whether can not need like this to determine has monitored Internet resources in the website of website domain name correspondence, then the device of embodiment of the invention monitoring network resource can further include: trigger module 50.
Trigger module 50, be used for the website domain name determined at corresponding relation determination module 10 and the corresponding relation of network resource identifier, the number of the network resource identifier of same website domain name correspondence triggers 20 pairs of network resource identifier set with the all-network resource identification of this website domain name correspondence and foundation in advance of matching module and mates greater than second threshold value.
Need to prove that the device of embodiment of the invention monitoring network resource can be device independently, also can be the server that download is provided in the P2SP technology.
Because the embodiment of the invention can continue download message is analyzed (according to the time of setting, from each download message that obtains, extract download address and network resource identifier), so can in time find to provide the website of illegal resource downloading, as long as and have user's download just can find, thereby can find the website that all-network resource searching system can not find.
As shown in Figure 2, the method for embodiment of the invention monitoring network resource comprises the following steps:
Step 500, determine the corresponding relation of website domain name and network resource identifier.
Wherein, the corresponding relation of website domain name and network resource identifier can adopt database or file or other forms to store, and can store in the device of present embodiment, also can store into to supply the device of present embodiment to search in other entities.
If the employing database, then this database can be realized by the relevant database technology.Such as: on server, the relational data library software can be installed, and the application programming interfaces that can adopt relevant database manufacturer to provide.Generally in relevant database, adopt the interface routine of Structured Query Language (SQL) as the management database content.
Step 501, the all-network resource identification of the website domain name correspondence determined is mated with the network resource identifier set of setting up in advance.
Wherein, the network resource identifier set is set up according to the following step:
Step S1, determine a plurality of samples website that the Type of website is identical;
Step S2, from the sample website, obtain Internet resources;
The network resource identifier of step S3, definite each Internet resources that obtains, and be combined into the network resource identifier set.
Preferably, can check regularly whether the sample website also exists, and handles if exist then do not need, otherwise, can upgrade in the sample website of looking for other.
Step 502, in the all-network resource identification of website domain name correspondence, the ratio of all-network resource identification that network resource identifier on the coupling accounts for website domain name correspondence determines that the Type of website of network resource identifier set website domain name correspondence is the Type of website of the corresponding website of website domain name network resource identifier set during greater than first threshold.
Wherein, the corresponding relation of the network resource identifier set and the Type of website can adopt database or file or other forms to store, and can store in the device of present embodiment, also can store into to supply the device of present embodiment to search in other entities.
If the employing database, then this database can be realized by the relevant database technology.Such as: on server, the relational data library software can be installed, and the application programming interfaces that can adopt relevant database manufacturer to provide.Generally in relevant database, adopt the interface routine of Structured Query Language (SQL) as the management database content.
The corresponding relation of the network resource identifier set and the Type of website is that the network resource content of the network resource identifier correspondence in gathering according to network resource identifier determines, such as network resource content is pirated film, and then the corresponding Type of website of network resource identifier set is the pirated film website; Such as network resource content is pornofilm, and then the corresponding Type of website of network resource identifier set is the porn site.
Specifically which type of Type of website need be identified, corresponding network resource identifier set can be set up as required.
In specific implementation process, if the first threshold of setting is 10%, the number percent that computing module 50 is determined is 75, and the corresponding Type of website of network resource identifier set is the pirated film website, determine 75% greater than 10%, determine that then the Type of website of website domain name is the pirated film website.
Wherein, step 500 can further include:
Step a500, according to the time of setting, from each download message that obtains, extract download address and network resource identifier.
With P2SP is example, when the user uses the P2SP technology to download, server end has been stored a large amount of download messages, because it is very fast that server increases the speed of download message, so download message of every increase is all analyzed obviously very high for the requirement of device, preferable mode is to set a time, the download message that server increases is analyzed, analyze once such as being set to 24 hours, the folder of can creating a file accordingly, with the download message after the analyzing and processing as in this document folder, be convenient to next processing can quick identification which be the download message that increases newly.
In specific implementation process, handle a download message, will from this download message, extract two tuples of a download address and network resource identifier, after all finishing dealing with, can obtain the two tuples set of download address and network resource identifier accordingly.
Step b500, each download address that will extract are converted to the website domain name.
In specific implementation process, each download address all is converted to the website domain name after, can obtain the two tuples set of website domain name and network resource identifier.
Step c500, determine the corresponding relation of website domain name and network resource identifier.
From the two tuples set of website domain name and network resource identifier, extract the all-network resource identification of same website domain name correspondence, thereby determine the corresponding relation of website domain name and network resource identifier.
Concrete, can be earlier that each different website domain name is distributed a group, in the group of network resource identifier during two tuples that will contain the website domain name are then gathered as for the website domain name of correspondence.
Such as<website domain name A, network resource identifier A 〉,<website domain name B, network resource identifier B〉and<website domain name A, network resource identifier C 〉.
Be website domain name A assign group 1, website domain name B assign group 2, then with network resource identifier A and network resource identifier C as in the group 1, network resource identifier B is as in the group 2.
So just determine the corresponding relation of website domain name and network resource identifier.
Because the representative of download message has a resource by user's download, can also count like this in the domain name of same website, the number of times that the Internet resources of a network resource identifier correspondence are downloaded, thus know the attention rate that is subjected to of corresponding website domain name.
Wherein, step 502 can further include:
Step a502, determine that network resource identifier set contains the quantity of the network resource identifier of website domain name correspondence.
Such as in the set of: network resource identifier A, B, C and D being arranged, 4 network resource identifiers, the network resource identifier of website domain name 1 correspondence is B, C, D, E and F, then the quantity of the network resource identifier of website domain name 1 correspondence that contains of network resource identifier set is 3.
Step b502, according to the quantity of the all-network resource identification of quantity of determining and website domain name correspondence, determine that the network resource identifier on the coupling of website domain name correspondence accounts for the number percent of the all-network resource identification of website domain name correspondence.
Such as: the network resource identifier of website domain name 1 correspondence is B, C, D, E and F, the quantity of the network resource identifier of website domain name 1 correspondence that the network resource identifier set contains is 3, and the number percent that the network resource identifier on the coupling of then definite website domain name correspondence accounts for the all-network resource identification of website domain name correspondence is 3/5=60%.
Step c502, during greater than first threshold, determine that the corresponding Type of website of network resource identifier set is the Type of website of the website of website domain name correspondence at the number percent of determining.
Because the quantity of the network resource identifier of a website domain name correspondence might determining is seldom, whether such as several, can not need like this to determine has monitored Internet resources in the website of website domain name correspondence, then can further include before the step 501:
Determine in the corresponding relation of website domain name and network resource identifier that the number of the network resource identifier of same website domain name correspondence is greater than second threshold value.
If the number of the network resource identifier of same website domain name correspondence is not more than second threshold value, then can this website domain name not handled.
Because the embodiment of the invention can continue download message is analyzed (according to the time of setting, from each download message that obtains, extract download address and network resource identifier), so can in time find to provide the website of illegal resource downloading, as long as and have user's download just can find, thereby can find the website that all-network resource searching system can not find.
Those skilled in the art should be understood that, each module of the above-mentioned embodiment of the invention or each step can realize with the general calculation device, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation element forms, alternatively, they can realize with the executable program code of calculation element, thereby, they can be stored in the memory storage and carry out by calculation element.Like this, the present invention is not restricted to any specific hardware and software combination.
From the foregoing description as can be seen: the embodiment of the invention is determined the corresponding relation of website domain name and network resource identifier; All described network resource identifiers of described website domain name correspondence and the network resource identifier set of setting up are in advance mated; In the all-network resource identification of described website domain name correspondence, the ratio of all-network resource identification that network resource identifier on the coupling accounts for described website domain name correspondence is during greater than first threshold, and the Type of website of determining described website domain name correspondence is the corresponding Type of website of described network resource identifier set.Owing to the Internet resources whether monitoring is arranged in the website that can determine website domain name correspondence, thereby do not need the content of each website is all grasped and analyzes, just can determine whether the website provides the website of illegal resource, reduced and handled required storage space and calculated amount, improved processing speed and treatment effeciency.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (10)

1. the method for definite Type of website is characterized in that, this method comprises:
Determine the corresponding relation of website domain name and network resource identifier;
The all-network resource identification of described website domain name correspondence and the network resource identifier set of setting up are in advance mated, wherein network resource identifier is the content signature CID that is used to identify file in download, and described CID calculates the back according to default algorithm to the content-data of binary file to obtain;
In the all-network resource identification of described website domain name correspondence, the ratio of all-network resource identification that network resource identifier on the coupling accounts for described website domain name correspondence is during greater than first threshold, and the Type of website of determining described website domain name correspondence is the corresponding Type of website of described network resource identifier set.
2. the method for claim 1 is characterized in that, described network resource identifier set is set up according to the following step:
Determine a plurality of samples website that the Type of website is identical;
From the sample website, obtain Internet resources;
Determine the network resource identifier of each Internet resources of acquisition, and be combined into the network resource identifier set.
3. the method for claim 1 is characterized in that, described network resource identifier set with all described network resource identifiers of described website domain name correspondence and foundation in advance determines that the Type of website of website domain name correspondence also comprises before after mating:
Determine that described network resource identifier set contains the quantity of the network resource identifier of described website domain name correspondence;
According to the quantity of the all-network resource identification of quantity of determining and described website domain name correspondence, determine that the network resource identifier on the coupling of described website domain name correspondence accounts for the ratio of the all-network resource identification of described website domain name correspondence.
4. the method for claim 1 is characterized in that, the corresponding relation of described definite website domain name and network resource identifier comprises:
Time according to setting, from each download message that obtains, extract download address and network resource identifier;
Each download address that extracts is converted to the website domain name;
Determine the corresponding relation of website domain name and network resource identifier.
5. as the described method of the arbitrary claim of claim 1 to 4, it is characterized in that described all described network resource identifiers of described website domain name correspondence are mated with the network resource identifier set of setting up in advance also comprises before:
Determine in the corresponding relation of described website domain name and network resource identifier, the number of the network resource identifier of same website domain name correspondence triggers all described network resource identifiers of described website domain name correspondence and the network resource identifier set of setting up is in advance mated greater than second threshold value.
6. the device of definite Type of website is characterized in that, this device comprises:
The corresponding relation determination module is used for determining the corresponding relation of website domain name and network resource identifier;
Matching module, be used for the all-network resource identification of described website domain name correspondence and the network resource identifier set of setting up are in advance mated, wherein network resource identifier is the content signature CID that is used to identify file in download, and described CID calculates the back according to default algorithm to the content-data of binary file to obtain;
Processing module, be used for all-network resource identification in described website domain name correspondence, the ratio of all-network resource identification that network resource identifier on the coupling accounts for described website domain name correspondence is during greater than first threshold, and the Type of website of determining described website domain name correspondence is the corresponding Type of website of described network resource identifier set.
7. device as claimed in claim 6 is characterized in that, described device also comprises:
Module is set up in set, is used for determining the identical a plurality of samples website of the Type of website, obtains Internet resources from the sample website, determines the network resource identifier of each Internet resources of acquisition, and is combined into the network resource identifier set.
8. device as claimed in claim 6 is characterized in that, described processing module comprises:
The quantity determination module is used for determining that described network resource identifier set contains the quantity of the network resource identifier of described website domain name correspondence;
Computing module is used for the quantity according to the all-network resource identification of quantity of determining and described website domain name correspondence, determines that the network resource identifier on the coupling of described website domain name correspondence accounts for the ratio of the all-network resource identification of described website domain name correspondence;
Type of website determination module is used in the ratio of determining during greater than first threshold, and the Type of website of determining website domain name correspondence is the corresponding Type of website of network resource identifier set.
9. device as claimed in claim 6 is characterized in that, described corresponding relation determination module comprises:
Extraction module is used for extracting download address and network resource identifier according to the time of setting from each download message that obtains;
Modular converter, each download address that is used for extracting is converted to the website domain name;
Set up module, be used for determining the corresponding relation of website domain name and network resource identifier.
10. as the described device of the arbitrary claim of claim 6 to 9, it is characterized in that described device also comprises:
Trigger module, be used for the described website domain name determined at described corresponding relation determination module and the corresponding relation of network resource identifier, the number of the network resource identifier of same website domain name correspondence triggers described matching module this website domain name is handled greater than second threshold value.
CN2008101858391A 2008-12-18 2008-12-18 Method and apparatus for confirming website type Expired - Fee Related CN101488140B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008101858391A CN101488140B (en) 2008-12-18 2008-12-18 Method and apparatus for confirming website type

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008101858391A CN101488140B (en) 2008-12-18 2008-12-18 Method and apparatus for confirming website type

Publications (2)

Publication Number Publication Date
CN101488140A CN101488140A (en) 2009-07-22
CN101488140B true CN101488140B (en) 2011-01-19

Family

ID=40891034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008101858391A Expired - Fee Related CN101488140B (en) 2008-12-18 2008-12-18 Method and apparatus for confirming website type

Country Status (1)

Country Link
CN (1) CN101488140B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152371B (en) * 2011-12-07 2016-06-22 腾讯科技(深圳)有限公司 P2SP downloads monitoring and managing method and system
CN108512720B (en) * 2018-03-02 2021-01-26 杭州迪普科技股份有限公司 Website traffic statistical method and device

Also Published As

Publication number Publication date
CN101488140A (en) 2009-07-22

Similar Documents

Publication Publication Date Title
CN101408876B (en) Method and system for searching full text of electric document
CN101477527B (en) Multimedia resource retrieval method and apparatus
CN111104579A (en) Identification method and device for public network assets and storage medium
CN106534268B (en) Data sharing method and device
CN101158981A (en) Method, system and device for classifying downloaded resource
CN101339560B (en) Method and device for searching series data, and search engine system
CN104636368B (en) Data retrieval method, device and server
US10491606B2 (en) Method and apparatus for providing website authentication data for search engine
CN112445997A (en) Method and device for extracting CMS multi-version identification feature rule
CN113656673A (en) Master-slave distributed content crawling robot for advertisement delivery
CN102882988A (en) Method, device and equipment for acquiring address information of resource information
CN108388606A (en) A kind of method and computer equipment verifying library literary name section name in Sql sentences
CN101488140B (en) Method and apparatus for confirming website type
KR19990070968A (en) How to Search and Database Your Internet Resources
CN102298609B (en) Document gathering system and method
CN107766342A (en) A kind of recognition methods of application and device
CN111209325B (en) Service system interface identification method, device and storage medium
Bernstein et al. Compact features for detection of near-duplicates in distributed retrieval
CN105099996B (en) Website verification method and device
CN101977251A (en) Server-side website resource optimization device and optimization method thereof
CN106250440B (en) Document management method and device
CN101340463B (en) Method and apparatus for determining network resource type
CN101136927A (en) Network forum implementing method and system
CN114490246A (en) Monitoring method, monitoring device, electronic equipment and storage medium
CN113760849A (en) Log processing method, system, electronic device and computer readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110119

Termination date: 20111218