CN113783855A - Site evaluation method, site evaluation device, electronic apparatus, storage medium, and program product - Google Patents

Site evaluation method, site evaluation device, electronic apparatus, storage medium, and program product Download PDF

Info

Publication number
CN113783855A
CN113783855A CN202111007121.5A CN202111007121A CN113783855A CN 113783855 A CN113783855 A CN 113783855A CN 202111007121 A CN202111007121 A CN 202111007121A CN 113783855 A CN113783855 A CN 113783855A
Authority
CN
China
Prior art keywords
site
determining
geographic
station
belongs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111007121.5A
Other languages
Chinese (zh)
Other versions
CN113783855B (en
Inventor
王鹏
刘伟
余文利
陈由之
杨国强
张博
林赛群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111007121.5A priority Critical patent/CN113783855B/en
Publication of CN113783855A publication Critical patent/CN113783855A/en
Priority to PCT/CN2022/086180 priority patent/WO2023029486A1/en
Application granted granted Critical
Publication of CN113783855B publication Critical patent/CN113783855B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/52Network services specially adapted for the location of the user terminal

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The disclosure provides a site evaluation method, a site evaluation device, electronic equipment, a storage medium and a program product, relates to the field of network security and the field of content recommendation, and can be applied to site link grabbing and site library maintenance scenes. The method comprises the following steps: obtaining a set of internet protocol addresses associated with a site; determining a set of geographic features associated with a set of internet protocol addresses, a geographic feature in the set of geographic features indicating a geographic location at which a server associated with the site is located; and determining whether the site belongs to a bad site based on the regional feature set. By the method, whether the site belongs to the bad site or not can be judged based on the Internet protocol address associated with the site, so that the cost for judging the bad site can be reduced, and the quality and the efficiency of site link capture and site library maintenance can be improved.

Description

Site evaluation method, site evaluation device, electronic apparatus, storage medium, and program product
Technical Field
The present disclosure relates to the field of network security and the field of content recommendation, applicable to site link crawling and site library maintenance scenarios, and more particularly, to a site evaluation method, a site evaluation apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
Background
The internet is added with tens of thousands of domain names or more every day, and the new and the old of sites are in frequent alternation. For the average netizen, it is often difficult for well-known sites to perceive their life cycle for years or even decades, but for the internet, the memory of the site's existence is short-lived. With the rapid development of the technology for creating and maintaining the sites, some people create the sites in batches, perform resource production of the black and gray products in a multi-line parallel mode, and frequently provide services by frequently replacing the sites in order to avoid management and control. These sites typically include worthless spam and objectionable content and therefore may also be referred to as objectionable sites. If these bad sites are left in the normal internet ecology and are shown in the public view, they not only degrade the user's experience with the internet, but they also promote the dissemination of bad information to some extent. Meanwhile, if too many bad sites are included in the site library, the query experience of the user is also seriously influenced.
However, conventional techniques for site evaluation do not solve the above problems with high quality and efficiency.
Disclosure of Invention
According to an embodiment of the present disclosure, there are provided a site evaluation method, a site evaluation apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
In a first aspect of the present disclosure, there is provided a site evaluation method, including: obtaining a set of internet protocol addresses associated with a site; determining a set of geographic features associated with a set of internet protocol addresses, a geographic feature in the set of geographic features indicating a geographic location at which a server associated with the site is located; and determining whether the site belongs to a bad site based on the regional feature set.
In a second aspect of the present disclosure, there is provided a station evaluation apparatus including: a first acquisition module configured to acquire a set of internet protocol addresses associated with a site; a first determination module configured to determine a set of geographic features associated with a set of internet protocol addresses, a geographic feature in the set of geographic features indicating a geographic location at which a server associated with a site is located; and a second determination module configured to determine whether the site belongs to a bad site based on the regional feature set.
In a third aspect of the present disclosure, an electronic device is provided, comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to implement a method according to the first aspect of the disclosure.
In a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to implement a method according to the first aspect of the present disclosure.
In a fifth aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the method according to the first aspect of the present disclosure.
With the technology according to the present application, a site evaluation method is provided, with which it is possible to determine whether a site belongs to a bad site based on an internet protocol address associated with the site, so that the cost of determining a bad site can be reduced, and therefore the quality and efficiency of site link capture and site library maintenance can be improved.
It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure. It should be understood that the drawings are for a better understanding of the present solution and do not constitute a limitation of the present disclosure. Wherein:
FIG. 1 illustrates a schematic block diagram of a site evaluation environment 100 in which a site evaluation method in certain embodiments of the present disclosure may be implemented;
FIG. 2 shows a flow diagram of a site evaluation method 200 according to an embodiment of the present disclosure;
FIG. 3 shows a flow diagram of a site evaluation method 300 according to an embodiment of the present disclosure;
fig. 4 shows a schematic block diagram of a station evaluation apparatus 400 according to an embodiment of the present disclosure; and
FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure.
Like or corresponding reference characters designate like or corresponding parts throughout the several views.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As described above in the background, the conventional techniques for site evaluation cannot solve the above-described problems with high quality and efficiency. In particular, in conventional approaches, purely classification model-based approaches are employed. The use of classification models for site evaluation is essentially based on the content included in the site, for which reason ten, hundreds, or even more pieces of information may need to be extracted from each site to support site evaluation using classification models. Therefore, if site evaluation needs to be done for 30 million sites in the Internet, it may need to be judged using a classification model for 300-. As can be seen, conventional techniques for site evaluation are inefficient. Meanwhile, since the forms and contents of information extracted from a site are classified very many and are greatly different from each other, the classification model also has a quality problem in evaluating the site.
In order to at least partially solve one or more of the above problems and other potential problems, in view of certain characteristics of most sites hosted by people who aim to produce spam and objectionable content, the present application proposes a site evaluation method, and a technical solution of the method is to determine whether a site belongs to a objectionable site based on an internet protocol address associated with the site, so that the cost of determining the objectionable site can be reduced, and thus the quality and efficiency of site link crawling and site library maintenance can be improved.
FIG. 1 illustrates a schematic block diagram of a site evaluation environment 100 in which a site evaluation method in certain embodiments of the present disclosure may be implemented. In accordance with one or more embodiments of the present disclosure, site evaluation environment 100 may be a cloud environment. As shown in fig. 1, site evaluation environment 100 includes a computing device 110. In the site evaluation environment 100, evaluation related data 120 is provided to the computing device 110 as input to the computing device 110. The assessment related data 120 may include, for example, a domain name of the site or a set of internet protocol addresses associated with the site. In accordance with one or more embodiments of the present disclosure, querying an internet protocol address associated with a site at different times for the same site may result in different internet protocol addresses. Thus, the internet protocol address associated with a site may be acquired for the same site multiple times within a predetermined time period, thereby forming a set of internet protocol addresses. One or more internet protocol addresses may be included in the set of internet protocol addresses.
After obtaining the set of internet protocol addresses associated with the site, the computing device 110 may determine a set of geographic features associated with the set of internet protocol addresses. In accordance with one or more embodiments of the present disclosure, a geographic feature of a set of geographic features indicates a geographic location at which a server associated with a site is located.
After determining the set of regional characteristics associated with the site and the set of internet protocol addresses, the computing device 110 can determine whether the site belongs to a bad site based on the set of regional characteristics.
It should be appreciated that the set of geographic features associated with the site and the set of internet protocol addresses can also be provided directly to the computing device 110 as part of the evaluation related data 120. At this point, the computing device 110 may not need to determine this set of regional characteristics, but may directly utilize this set of regional characteristics to determine whether the site belongs to a bad site.
According to one or more embodiments of the present disclosure, the evaluation related data 120 may further include a correspondence table for determining a set of regional features, the correspondence table indicating a correspondence between a geographic location and an internet protocol address range. At this time, the computing device 110 may determine a set of regional features using the acquired set of internet protocol addresses based on the correspondence table.
According to one or more embodiments of the present disclosure, the assessment related data 120 can also include Chinese content and/or total content associated with the site. At this point, the computing device 110 may determine whether the site belongs to a bad site based further on the chinese content and/or the total content associated with the site based on the set of regional characteristics.
It should be appreciated that the site evaluation environment 100 is merely exemplary and not limiting, and is scalable in that more computing devices 110 can be included and more evaluation-related data 120 can be provided as input to the computing devices 110, such that the need for more users to simultaneously utilize more computing devices 110, and even more evaluation-related data 120 to simultaneously or non-simultaneously evaluate more sites to determine whether those sites are undesirable sites, can be met.
In the site evaluation environment 100 shown in FIG. 1, the input of evaluation-related data 120 to the computing device 110 can be made over a network.
Fig. 2 shows a flow diagram of a site evaluation method 200 according to an embodiment of the present disclosure. In particular, the site evaluation method 200 may be performed by the computing device 110 in the site evaluation environment 100 shown in FIG. 1. It should be understood that the site evaluation method 200 may also include additional operations not shown and/or may omit the operations shown, as the scope of the present disclosure is not limited in this respect.
At block 202, the computing device 110 obtains a set of internet protocol addresses associated with the site. In accordance with one or more embodiments of the present disclosure, computing device 110 may obtain an internet protocol address associated with a site for the same site multiple times within a predetermined time period, thereby forming a set of internet protocol addresses.
According to some embodiments of the present disclosure, the computing device 110 may obtain a set of internet protocol addresses associated with a site according to a domain name service provided by a Domain Name System (DNS). The domain name service may be provided by different operators, and there may be differences in authority of top-level domain name queries provided by different operators. Thus, the computing device 110 may use a domain name service with higher authority to obtain a set of internet protocol addresses associated with a site.
According to other embodiments of the present disclosure, the computing device 110 may also obtain a set of internet protocol addresses associated with a site through a form such as a search engine and a domain name query service provided by a company associated with the search engine.
At block 204, the computing device 110 determines a set of geographic features associated with the set of internet protocol addresses obtained at block 202. In accordance with one or more embodiments of the present disclosure, a geographic feature of a set of geographic features indicates a geographic location at which a server associated with a site is located.
According to one or more embodiments of the present disclosure, the computing device 110 may first obtain a correspondence table indicating a correspondence between geographic locations and internet protocol address ranges, and then determine a set of geographic features associated with the set of internet protocol addresses based on the correspondence table. For example, the correspondence table may indicate that the IP address ranges 23.248.192.1-23.248.192.255 have a correspondence with "north american region". Thus, when the internet protocol address associated with a site is either 23.248.192.3 or 23.248.192.27, it may be determined that the two internet protocol addresses are associated with a geographic location of "north american region".
At block 206, the computing device 110 determines whether the site belongs to a bad site based on the set of regional characteristics determined at block 204. According to one or more embodiments of the present disclosure, the computing device 110 may determine whether the site belongs to a bad site according to information such as whether the regional features in the regional feature set include foreign geographic locations and the number of foreign geographic locations included.
Fig. 3 shows a flow diagram of a site evaluation method 300 according to an embodiment of the disclosure. In particular, the site evaluation method 300 may be performed by the computing device 110 in the site evaluation environment 100 shown in FIG. 1. It should be understood that the site evaluation method 300 may also include additional operations not shown and/or may omit the operations shown, as the scope of the present disclosure is not limited in this respect.
At block 302, the computing device 110 obtains a set of internet protocol addresses associated with a site. The content referred to in block 302 is the same as that referred to in block 202 and will not be described in detail here.
At block 304, the computing device 110 determines a set of geographic features associated with the set of internet protocol addresses obtained at block 202. The content referred to in block 304 is the same as that referred to in block 204 and will not be described further herein.
At block 306, the computing device 110 determines whether the geographic locations indicated by the regional features in the regional feature set determined at block 304 are all domestic geographic locations. If it is determined at block 306 that the geographic locations indicated by the regional features in the regional feature set are both domestic geographic locations, the method 300 proceeds to block 308; otherwise, method 300 proceeds to block 310.
In accordance with one or more embodiments of the present disclosure, a poor site generally uses a foreign server for purposes of evasive regulation, etc. Thus, a site is typically not a bad site if the determined internet protocol addresses for that site are all associated with a domestic geographic location.
At block 308, the computing device 110 determines that the site does not belong to a bad site. As previously described, when the determination of block 306 is yes, the computing device 110 may determine that the site does not belong to a bad site at block 308, at which point the flow of the method 300 ends, since a site will not typically be a bad site when the determined internet protocol addresses for that site are all associated with a national address location.
At block 310, the computing device 110 determines whether the site belongs to a normal foreign language site. If at block 310 it is determined that the site belongs to a normal foreign language site, the method 300 proceeds to block 308; otherwise, method 300 proceeds to block 312. According to one or more embodiments of the present disclosure, the server of a normal foreign site will typically be located at a foreign geographic location, and thus the internet protocol address determined for the site at that time will be associated with the foreign geographic location. In this case, it should be determined that the station does not belong to a bad station based on the fact that the station belongs to a normal foreign language station.
As previously described, because the sites that belong to the normal foreign language sites do not belong to the bad sites, when the determination of block 310 is yes, the computing device 110 may determine that the sites do not belong to the bad sites at block 308, at which point the flow of method 300 ends.
According to one or more embodiments of the present disclosure, computing device 100 may determine whether a site belongs to a normal foreign language site by the domain name of the site. For example, if the domain name of a site includes a suffix such as. us or. jp, this indicates that the site is a foreign language site in the united states or japan, and it can be considered that the site belongs to a normal foreign language site.
Thus, according to some embodiments of the present disclosure, the computing device 110 may first obtain a domain name associated with the site, and then determine whether the site belongs to a normal foreign language site based on the domain name.
According to further embodiments of the present disclosure, even if a site is determined to be a foreign site, it is necessary to consider whether the country associated with the foreign site and the geographic location associated with the site match, and the site is determined to belong to a normal foreign site only if the country of the foreign site and the geographic location associated with the site match. At this point, the computing device 110 may first obtain a domain name associated with the site, then determine a country associated with the site based on the domain name, and then determine that the site belongs to a normal foreign site based on a match between the determined country and a foreign geographic location associated with the site. For example, if the domain name of a site includes a suffix of.jp, this indicates that the site is a foreign language site in Japan. However, if the internet protocol address associated with this site at this time is 23.248.192.3, then it may be determined that the internet protocol address associated with this site is associated with a geographic location of "north american" according to the embodiments previously described with respect to block 204. At this time, since the determined country "japan" does not match the determined geographical location "north american region", it is impossible to determine that this site belongs to a normal foreign language site.
According to still other embodiments of the present disclosure, the computing device 110 may determine whether a site belongs to a normal foreign language site based on the proportion of Chinese content that the site includes. For example, the computing device 110 may first determine a proportion of chinese content included by the site to the total content included by the site, and determine that the site belongs to a normal foreign language site when the determined proportion is less than a threshold proportion, e.g., 5%. It should be appreciated that in determining the ratio of Chinese content included by a site to the total content included by the site, the computing device 110 may not need to obtain the total Chinese content included by the site and the total content, but may analogically determine the ratio of Chinese content included by the site to the total content included by the site by sampling the site, e.g., randomly extracting the form of the corresponding content in a predetermined number of pages included by the site. For example, if the ratio of chinese content to total content is 4% in 10 pages of a randomly extracted site, it may be determined by analogy that the ratio of chinese content included by the site to total content included by the site is also 4% for the entire site.
At block 312, the computing device 110 determines a number of foreign geographic locations included in the geographic location indicated by the geographic feature in the set of geographic features determined at block 304. In accordance with one or more embodiments of the present disclosure, computing device 110 may obtain an internet protocol address associated with a site for the same site multiple times within a predetermined time period, thereby forming a set of internet protocol addresses. Thus, the geographic location indicated by a geographic feature in the set of geographic features may include multiple foreign geographic locations. At this time, whether the station belongs to a bad station may be determined according to the number of foreign geographic positions included in the geographic position indicated by the regional characteristic in the regional characteristic set.
At block 314, the computing device 110 determines whether the number of foreign geographic locations included in the geographic locations indicated by the regional features in the regional feature set determined at block 312 is greater than a threshold number. If it is determined at block 314 that the geographic locations indicated by the regional features in the regional feature set include a number of foreign geographic locations greater than the threshold number, the method 300 proceeds to block 316; otherwise, the method 300 proceeds to block 318.
According to one or more embodiments of the present disclosure, when the number of foreign geographic locations included in the geographic location indicated by the geographic feature in the geographic feature set is greater than a threshold number, for example, 3, it may be considered that the usage internet protocol pool behavior exists for this station. The internet protocol pool refers to a case where a large number, for example, 10 internet protocol addresses are included in the internet protocol addresses associated with a site acquired for the same site a plurality of times within a predetermined period of time, and the internet protocol addresses are associated with a plurality, for example, 3 different foreign geographic locations. In this case, it can be considered that the owner of the site is likely to take a way of using the internet protocol pool to evade the regulation, and thus this site can be considered to be a bad site.
For example, for site wap4.173kxs.com, the following internet protocol addresses may be acquired within a predetermined time period:
23.248.192.27
23.235.160.147
103.37.3.50
156.234.80.107
23.248.192.3
23.226.55.155
23.248.192.35
156.234.80.234
103.75.47.123
103.37.0.170
23.248.199.139
43.241.46.91
23.248.192.11
43.241.46.99
23.248.196.234。
then, for the internet protocol address, based on a correspondence table indicating a correspondence between a geographic location and an internet protocol address range, the correspondence between the internet protocol address and the geographic location may be determined as:
23.248.192.27[ North America region ]
23.235.160.147[ Japan ]
103.37.3.50[ Europe ]
156.234.80.107[ Japan ]
23.248.192.3[ North America region ]
23.226.55.155[ Europe ]
23.248.192.35[ North America region ]
156.234.80.234[ Japan ]
103.75.47.123[ Korea ]
103.37.0.170[ Europe ]
23.248.199.139[ North America region ]
43.241.46.91[ Europe ]
23.248.192.11[ North America region ]
43.241.46.99[ Europe ]
23.248.196.234[ North America ].
At this time, since the internet protocol address associated with the site wap4.173kxs.com includes up to 15 and includes 4 different foreign geographical locations "north american region", "japan", "europe" and "korea", the site wap4.173kxs.com should belong to a bad site
At block 316, the computing device 110 determines that the site belongs to a bad site. As previously described, since a site is determined to be a bad site due to the presence of usage internet protocol pool behavior when the number of foreign geographic locations included in the geographic location indicated by the geographic feature in the determined geographic feature set for the site is greater than the threshold number, when the determination of block 314 is yes, the computing device 110 may determine that the site is a bad site at block 316, at which point the flow of method 300 ends.
At block 318, the computing device 110 determines whether the site belongs to a bad site based on the content included with the site. According to one or more embodiments of the present disclosure, when it is still impossible to determine whether a site belongs to a bad site through blocks 302 to 316, it may be determined whether the site belongs to a bad site based on contents included in the site.
For example, a site may be determined to be a bad site when the content included with the site indicates at least one of:
the content included in the site includes illegal information, for example, the content included in the site includes information such as pornography, lottery, evil education, and the like;
the content included by the site relates to bad collection, for example, the content included by the site comprises content formed by crawling and then piecing together aiming at other sites;
the content included in the site relates to piracy, for example, the content included in the site includes a pirated novel, a cartoon and the like;
the content included by a site relates to drainage, e.g., the content included by a site includes drainage behavior to other sites;
the content included by the site relates to traffic hijacking, for example, the content included by the site includes traffic transfer realized in the form of Trojan horse and the like;
the content included by the site relates to resource flooding situations, for example, the content included by the site relates to a water paste.
It should be appreciated that the content included by the site is indicative of at least one of the foregoing items may be determined by performing semantic recognition, image recognition, or the like on the content included by the site.
It should be understood that method 300 includes more steps than method 200 and may be considered an extension of method 200.
The foregoing describes, with reference to fig. 1-3, relevant content of a site evaluation environment 100 in which a site evaluation method in certain embodiments of the present disclosure may be implemented, a site evaluation method 200 according to an embodiment of the present disclosure, and a site evaluation method 300 according to an embodiment of the present disclosure. It should be understood that the above description is intended to better illustrate what is recited in the present disclosure, and is not intended to be limiting in any way.
It should be understood that the number of various elements and the size of physical quantities employed in the various drawings of the present disclosure are by way of example only and are not limiting upon the scope of the present disclosure. The above numbers and sizes may be arbitrarily set as needed without affecting the normal implementation of the embodiments of the present disclosure.
Details of the site evaluation method 200 and the site evaluation method 300 according to embodiments of the present disclosure have been described above with reference to fig. 1 to 3. Hereinafter, each module in the station evaluation apparatus will be described with reference to fig. 4.
Fig. 4 is a schematic block diagram of a station evaluation apparatus 400 according to an embodiment of the present disclosure. As shown in fig. 4, the station evaluation apparatus 400 may include: a first obtaining module 410 configured to obtain a set of internet protocol addresses associated with a site; a first determination module 420 configured to determine a set of geographic features associated with a set of internet protocol addresses, the geographic features in the set of geographic features indicating geographic locations at which servers associated with a site are located; and a second determination module 430 configured to determine whether a site belongs to a bad site based on the regional feature set.
In one or more embodiments, the first determining module 420 includes: a second obtaining module (not shown) configured to obtain a correspondence table indicating a correspondence between the geographic location and the internet protocol address range; and a third determination module (not shown) configured to determine the region feature set based on the correspondence table.
In one or more embodiments, the second determining module 430 includes: and a fourth determination module (not shown) configured to determine that the site does not belong to the bad site if the geographic positions indicated by the geographic features in the geographic feature set are all domestic geographic positions.
In one or more embodiments, the second determining module 430 includes: a fifth determination module (not shown) configured to determine that the geographic location indicated by the regional feature in the regional feature set comprises a foreign geographic location; a sixth determination module (not shown) configured to determine whether the site belongs to a normal foreign language site; and a seventh determining module (not shown) configured to determine that the station does not belong to a bad station if it is determined that the station belongs to a normal foreign language station.
In one or more embodiments, the sixth determining module comprises: a third obtaining module (not shown) configured to obtain a domain name associated with the site; and an eighth determination module (not shown) configured to determine whether the site belongs to a normal foreign language site based on the domain name.
In one or more embodiments, the eighth determining module comprises: a ninth determining module (not shown) configured to determine a country associated with the site based on the domain name; and a tenth determination module (not shown) configured to determine that the site belongs to a normal foreign language site if the country matches a foreign geographic location.
In one or more embodiments, the sixth determining module comprises: an eleventh determining module (not shown) configured to determine a ratio of the chinese content included by the site to the total content included by the site; and a twelfth determination module (not shown) configured to determine that the site belongs to a normal foreign language site if the ratio is less than the threshold ratio.
In one or more embodiments, the second determining module 430 includes: a thirteenth determining module (not shown) configured to determine the number of foreign geographic locations included in the geographic location indicated by the geographic feature in the geographic feature set; and a fourteenth determining module (not shown) configured to determine whether a station belongs to a bad station based on the number.
In one or more embodiments, wherein the fourteenth determining module comprises: a fifteenth determining module (not shown) configured to determine that the station belongs to a bad station if the number is greater than the threshold number.
In one or more embodiments, wherein the fourteenth determining module comprises: a sixteenth determining module (not shown) configured to determine that the number is less than or equal to the threshold number; and a seventeenth determining module (not shown) configured to determine whether the site belongs to a bad site based on content included in the site.
Through the above description with reference to fig. 1 to 4, the technical solution according to the embodiments of the present disclosure has many advantages over the conventional solution. For example, with the technical solution of the method, whether the site belongs to a bad site can be determined based on the internet protocol address associated with the site, so that the cost for determining the bad site can be reduced, and therefore the quality and efficiency of site link capture and site library maintenance can be improved.
Statistically, the total number of sites in the internet can reach over 30 billion, with geographic features for about 20% of sites indicating that the site is associated with a foreign geographic location, of these 20% of sites, about 80% have a use of the internet protocol pool, and about 10% are normal foreign sites. Therefore, by the technical solution according to the embodiments of the present disclosure, it is possible to efficiently determine whether about 6 hundred million sites belong to bad sites through internet protocol addresses of the sites, which cannot be easily achieved with the conventional solution.
FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. For example, the computing device 110 shown in fig. 1 and the site evaluation apparatus 400 shown in fig. 4 may be implemented by the electronic device 500. The electronic device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 501 performs the various methods and processes described above, such as the site evaluation methods 200 and 300. For example, in some embodiments, the site evaluation methods 200 and 300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When loaded into RAM 503 and executed by the computing unit 501, may perform one or more of the steps of the site evaluation methods 200 and 300 described above. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the site evaluation methods 200 and 300 in any other suitable manner (e.g., by way of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (23)

1. A site evaluation method, comprising:
obtaining a set of internet protocol addresses associated with a site;
determining a set of geographic features associated with the set of internet protocol addresses, a geographic feature in the set of geographic features indicating a geographic location at which a server associated with the site is located; and
determining whether the site belongs to a bad site based on the set of regional characteristics.
2. The method of claim 1, wherein determining the set of regional features comprises:
acquiring a corresponding relation table indicating the corresponding relation between the geographic position and the Internet protocol address range; and
determining the region feature set based on the correspondence table.
3. The method of claim 1, wherein determining whether the station belongs to the bad station comprises:
and if the geographic positions indicated by the regional characteristics in the regional characteristic set are domestic geographic positions, determining that the site does not belong to the bad site.
4. The method of claim 1, wherein determining whether the station belongs to the bad station comprises:
determining that the geographic location indicated by a geographic feature in the set of geographic features comprises a foreign geographic location;
determining whether the site belongs to a normal foreign language site; and
and if the station is determined to belong to the normal foreign language station, determining that the station does not belong to the bad station.
5. The method of claim 4, wherein determining whether the site belongs to the normal foreign language site comprises:
acquiring a domain name associated with the site; and
determining whether the site belongs to the normal foreign language site based on the domain name.
6. The method of claim 5, wherein determining whether the site belongs to the normal foreign language site comprises:
determining a country associated with the site based on the domain name; and
and if the country is matched with the foreign geographic position, determining that the site belongs to the normal foreign language site.
7. The method of claim 4, wherein determining whether the site belongs to the normal foreign language site comprises:
determining the proportion of the Chinese content included in the site to the total content included in the site; and
and if the ratio is smaller than a threshold ratio, determining that the station belongs to the normal foreign language station.
8. The method of claim 1, wherein determining whether the station belongs to the bad station comprises:
determining the number of foreign geographic positions included in the geographic positions indicated by the geographic features in the geographic feature set; and
determining whether the station belongs to the bad station based on the number.
9. The method of claim 8, wherein determining whether the station belongs to the bad station comprises:
determining that the station belongs to the bad station if the number is greater than a threshold number.
10. The method of claim 8, wherein determining whether the station belongs to the bad station comprises:
determining that the number is less than or equal to a threshold number; and
determining whether the site belongs to the bad site based on content included by the site.
11. A station evaluation apparatus comprising:
a first acquisition module configured to acquire a set of internet protocol addresses associated with a site;
a first determination module configured to determine a set of geographic features associated with the set of internet protocol addresses, a geographic feature of the set of geographic features indicating a geographic location at which a server associated with the site is located; and
a second determination module configured to determine whether the site belongs to a bad site based on the regional feature set.
12. The apparatus of claim 11, wherein the first determining module comprises:
a second obtaining module configured to obtain a correspondence table indicating a correspondence between a geographic location and an internet protocol address range; and
a third determination module configured to determine the region feature set based on the correspondence table.
13. The apparatus of claim 11, wherein the second determining module comprises:
a fourth determination module configured to determine that the site does not belong to the bad site if the geographic locations indicated by the regional features in the regional feature set are all domestic geographic locations.
14. The apparatus of claim 11, wherein the second determining module comprises:
a fifth determination module configured to determine that the geographic location indicated by a regional feature in the set of regional features comprises a foreign geographic location;
a sixth determination module configured to determine whether the site belongs to a normal foreign language site; and
a seventh determination module configured to determine that the station does not belong to the bad station if it is determined that the station belongs to the normal foreign language station.
15. The apparatus of claim 14, wherein the sixth determining means comprises:
a third obtaining module configured to obtain a domain name associated with the site; and
an eighth determination module configured to determine whether the site belongs to the normal foreign language site based on the domain name.
16. The apparatus of claim 15, wherein the eighth determining module comprises:
a ninth determining module configured to determine a country associated with the site based on the domain name; and
a tenth determination module configured to determine that the site belongs to the normal foreign language site if the country matches the foreign geographic location.
17. The apparatus of claim 14, wherein the sixth determining means comprises:
an eleventh determining module configured to determine a ratio of the chinese content included by the site to the total content included by the site; and
a twelfth determination module configured to determine that the site belongs to the normal foreign language site if the ratio is less than a threshold ratio.
18. The apparatus of claim 11, wherein the second determining module comprises:
a thirteenth determination module configured to determine a number of foreign geographic locations included in the geographic locations indicated by the regional features in the regional feature set; and
a fourteenth determination module configured to determine whether the station belongs to the bad station based on the number.
19. The apparatus of claim 18, wherein the fourteenth determining module comprises:
a fifteenth determination module configured to determine that the station belongs to the bad station if the number is greater than a threshold number.
20. The apparatus of claim 18, wherein the fourteenth determining module comprises:
a sixteenth determining module configured to determine that the number is less than or equal to a threshold number; and
a seventeenth determination module configured to determine whether the site belongs to the bad site based on content included in the site.
21. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.
22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.
23. A computer program product comprising a computer program which, when executed by a processor, performs the method of any one of claims 1-10.
CN202111007121.5A 2021-08-30 2021-08-30 Site evaluation method, apparatus, electronic device, storage medium, and program product Active CN113783855B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111007121.5A CN113783855B (en) 2021-08-30 2021-08-30 Site evaluation method, apparatus, electronic device, storage medium, and program product
PCT/CN2022/086180 WO2023029486A1 (en) 2021-08-30 2022-04-11 Site evaluation method and apparatus, and electronic device, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111007121.5A CN113783855B (en) 2021-08-30 2021-08-30 Site evaluation method, apparatus, electronic device, storage medium, and program product

Publications (2)

Publication Number Publication Date
CN113783855A true CN113783855A (en) 2021-12-10
CN113783855B CN113783855B (en) 2023-07-21

Family

ID=78840024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111007121.5A Active CN113783855B (en) 2021-08-30 2021-08-30 Site evaluation method, apparatus, electronic device, storage medium, and program product

Country Status (2)

Country Link
CN (1) CN113783855B (en)
WO (1) WO2023029486A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023029486A1 (en) * 2021-08-30 2023-03-09 北京百度网讯科技有限公司 Site evaluation method and apparatus, and electronic device, storage medium and program product

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354800A (en) * 2016-08-26 2017-01-25 中国互联网络信息中心 Undesirable website detection method based on multi-dimensional feature
CN106503244A (en) * 2016-11-08 2017-03-15 天津海量信息技术股份有限公司 A kind of processing method of URL similarity
CN109309668A (en) * 2018-08-30 2019-02-05 浙江贰贰网络有限公司 Website verification method, device, system, computer equipment and storage medium
CN109450853A (en) * 2018-10-11 2019-03-08 深圳市腾讯计算机系统有限公司 Malicious websites determination method, device, terminal and server
CN109522504A (en) * 2018-10-18 2019-03-26 杭州安恒信息技术股份有限公司 A method of counterfeit website is differentiated based on threat information
CN109891853A (en) * 2016-11-03 2019-06-14 微软技术许可有限责任公司 Impossible stroke is detected in being locally located
CN113269394A (en) * 2021-04-16 2021-08-17 合肥联宝信息技术有限公司 Data processing method, device and equipment and readable storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130297338A1 (en) * 2012-05-07 2013-11-07 Ingroove, Inc. Method for Evaluating the Health of a Website
CN107231447A (en) * 2016-03-23 2017-10-03 北大方正集团有限公司 A kind of website spatial identification method and system
US10796316B2 (en) * 2017-10-12 2020-10-06 Oath Inc. Method and system for identifying fraudulent publisher networks
CN109543118B (en) * 2018-11-12 2020-06-12 中国人民解放军战略支援部队信息工程大学 Web landmark reliability assessment method and device based on multi-layer decision
CN109787961B (en) * 2018-12-24 2021-08-10 上海晶赞融宣科技有限公司 False flow identification method and device, storage medium and server
CN109819066A (en) * 2019-01-31 2019-05-28 平安科技(深圳)有限公司 A kind of intelligently parsing field name method and device based on geographical location
CN113783855B (en) * 2021-08-30 2023-07-21 北京百度网讯科技有限公司 Site evaluation method, apparatus, electronic device, storage medium, and program product

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354800A (en) * 2016-08-26 2017-01-25 中国互联网络信息中心 Undesirable website detection method based on multi-dimensional feature
CN109891853A (en) * 2016-11-03 2019-06-14 微软技术许可有限责任公司 Impossible stroke is detected in being locally located
CN106503244A (en) * 2016-11-08 2017-03-15 天津海量信息技术股份有限公司 A kind of processing method of URL similarity
CN109309668A (en) * 2018-08-30 2019-02-05 浙江贰贰网络有限公司 Website verification method, device, system, computer equipment and storage medium
CN109450853A (en) * 2018-10-11 2019-03-08 深圳市腾讯计算机系统有限公司 Malicious websites determination method, device, terminal and server
CN109522504A (en) * 2018-10-18 2019-03-26 杭州安恒信息技术股份有限公司 A method of counterfeit website is differentiated based on threat information
CN113269394A (en) * 2021-04-16 2021-08-17 合肥联宝信息技术有限公司 Data processing method, device and equipment and readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023029486A1 (en) * 2021-08-30 2023-03-09 北京百度网讯科技有限公司 Site evaluation method and apparatus, and electronic device, storage medium and program product

Also Published As

Publication number Publication date
CN113783855B (en) 2023-07-21
WO2023029486A1 (en) 2023-03-09

Similar Documents

Publication Publication Date Title
CN102724187B (en) A kind of safety detection method for network address and device
CN109495467B (en) Method and device for updating interception rule and computer readable storage medium
CN107305611B (en) Method and device for establishing model corresponding to malicious account and method and device for identifying malicious account
US10380117B2 (en) Event occurrence place estimation method, computer-readable recording medium storing event occurrence place estimation program, and event occurrence place estimation apparatus
CN111163072B (en) Method and device for determining characteristic value in machine learning model and electronic equipment
CN104980446A (en) Detection method and system for malicious behavior
CN105516390B (en) Domain name management method and device
CN106599688A (en) Application category-based Android malicious software detection method
CN104143008A (en) Method and device for detecting phishing webpage based on picture matching
CN104751053A (en) Static behavior analysis method of mobile smart terminal software
CN112532624B (en) Black chain detection method and device, electronic equipment and readable storage medium
CN114095567A (en) Data access request processing method and device, computer equipment and medium
CN103150510A (en) Method and device for processing malicious behaviors of software
CN109144831B (en) Method and device for acquiring APP identification rule
CN104573033A (en) Dynamic URL filtering method and device
CN113783855A (en) Site evaluation method, site evaluation device, electronic apparatus, storage medium, and program product
CN110569509A (en) risk group identification method and device
CN103902906A (en) Mobile terminal malicious code detecting method and system based on application icon
CN108804501B (en) Method and device for detecting effective information
CN107451247B (en) User identification method and device
CN117040799A (en) Page interception rule generation and page access control method and device and electronic equipment
CN116738369A (en) Traffic data classification method, device, equipment and storage medium
CN104933061B (en) character string detection method and device and electronic equipment
CN113536086B (en) Model training method, account scoring method, device, equipment, medium and product
CN114531287B (en) Method, device, equipment and medium for detecting virtual resource acquisition behavior

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant