WO2023029486A1 - 站点评估方法、装置、电子设备、存储介质和程序产品 - Google Patents

站点评估方法、装置、电子设备、存储介质和程序产品 Download PDF

Info

Publication number
WO2023029486A1
WO2023029486A1 PCT/CN2022/086180 CN2022086180W WO2023029486A1 WO 2023029486 A1 WO2023029486 A1 WO 2023029486A1 CN 2022086180 W CN2022086180 W CN 2022086180W WO 2023029486 A1 WO2023029486 A1 WO 2023029486A1
Authority
WO
WIPO (PCT)
Prior art keywords
site
belongs
bad
geographic
determining
Prior art date
Application number
PCT/CN2022/086180
Other languages
English (en)
French (fr)
Inventor
王鹏
刘伟
余文利
陈由之
杨国强
张博
林赛群
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023029486A1 publication Critical patent/WO2023029486A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/52Network services specially adapted for the location of the user terminal

Definitions

  • the present disclosure relates to the field of network security and the field of content recommendation, and can be applied to site link crawling and site library maintenance scenarios, and more specifically, relates to a site evaluation method, a site evaluation device, electronic equipment, a computer-readable storage medium, and a computer program product .
  • a site evaluation method a site evaluation device, electronic equipment, a computer-readable storage medium and a computer program product are provided.
  • a site evaluation method including: acquiring a set of Internet Protocol addresses associated with a site; Indicating the geographic location of the server associated with the site; and determining whether the site is a bad site based on the set of geographical features.
  • a site evaluation device including: a first obtaining module configured to obtain a set of Internet Protocol addresses associated with a site; a first determining module configured to determine A geographical feature set associated with the address set, the geographical feature in the geographical feature set indicates the geographic location of the server associated with the site; and a second determination module configured to determine whether the site belongs to a bad site based on the geographical feature set.
  • an electronic device including at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by At least one processor executes to enable the at least one processor to implement the method according to the first aspect of the present disclosure.
  • a non-transitory computer-readable storage medium storing computer instructions for causing a computer to implement the method according to the first aspect of the present disclosure.
  • a computer program product comprising a computer program which, when executed by a processor, performs the method according to the first aspect of the present disclosure.
  • a method for assessing a site is provided.
  • FIG. 1 shows a schematic block diagram of a site assessment environment 100 in which a site assessment method in some embodiments of the present disclosure may be implemented;
  • FIG. 2 shows a flowchart of a site evaluation method 200 according to an embodiment of the present disclosure
  • FIG. 3 shows a flowchart of a site evaluation method 300 according to an embodiment of the present disclosure
  • FIG. 4 shows a schematic block diagram of a site evaluation device 400 according to an embodiment of the present disclosure.
  • FIG. 5 shows a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure.
  • the term “comprise” and its variants mean open inclusion, ie “including but not limited to”.
  • the term “or” means “and/or” unless otherwise stated.
  • the term “based on” means “based at least in part on”.
  • the terms “one example embodiment” and “one embodiment” mean “at least one example embodiment.”
  • the term “another embodiment” means “at least one further embodiment”.
  • the terms “first”, “second”, etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
  • this application proposes a site
  • the evaluation method using the technical solution of this method, can judge whether a site is a bad site based on the Internet protocol address associated with the site, thereby reducing the cost of judging bad sites, and thus improving the quality of site link capture and site library maintenance and efficiency.
  • FIG. 1 shows a schematic block diagram of a site assessment environment 100 in which the site assessment method in some embodiments of the present disclosure may be implemented.
  • site assessment environment 100 may be a cloud environment.
  • site assessment environment 100 includes computing device 110 .
  • assessment related data 120 is provided to computing device 110 as input to computing device 110 .
  • Assessment-related data 120 may include, for example, a domain name of a site or a set of Internet Protocol addresses associated with a site.
  • querying for an Internet Protocol address associated with a site at different times for the same site may result in different Internet Protocol addresses. Therefore, the IP address associated with the site can be obtained for the same site multiple times within a predetermined period of time, thereby forming an IP address set.
  • the set of Internet Protocol addresses may include one or more Internet Protocol addresses.
  • computing device 110 may determine the set of geographic features associated with the set of Internet Protocol addresses.
  • the geographical feature in the geographical feature set indicates the geographic location of the server associated with the site.
  • computing device 110 may determine whether the site belongs to a bad site based on the set of geographic features.
  • sets of geographic features associated with sets of sites and Internet Protocol addresses may also be provided directly to computing device 110 as part of assessment-related data 120 .
  • the computing device 110 may not need to determine the regional feature set, but may directly use the regional feature set to determine whether the site belongs to a bad site.
  • the evaluation-related data 120 may further include a correspondence table for determining a geographical feature set, and the correspondence table indicates a correspondence between geographic locations and Internet Protocol address ranges.
  • the computing device 110 may determine the regional feature set by using the acquired Internet protocol address set based on the correspondence table.
  • the evaluation-related data 120 may also include Chinese content and/or all content associated with the site.
  • the computing device 110 may determine whether the site is a bad site based on the set of regional features and further based on the Chinese content and/or all content associated with the site.
  • the site assessment environment 100 is exemplary only and not limiting, and that it is expandable in that more computing devices 110 may be included and more assessment-related data 120 may be provided as input to the computing devices 110 , so that more users can use more computing devices 110 at the same time, and even use more evaluation-related data 120 to simultaneously or non-simultaneously evaluate more sites to determine whether these sites belong to bad sites. .
  • the input of assessment-related data 120 to the computing device 110 may occur via a network.
  • FIG. 2 shows a flowchart of a site evaluation method 200 according to an embodiment of the present disclosure.
  • the site assessment method 200 may be performed by the computing device 110 in the site assessment environment 100 shown in FIG. 1 . It should be understood that the site evaluation method 200 may also include additional operations not shown and/or operations shown may be omitted, and the scope of the present disclosure is not limited in this respect.
  • computing device 110 obtains a set of Internet Protocol addresses associated with a site.
  • the computing device 110 may obtain the Internet Protocol address associated with the site for the same site multiple times within a predetermined period of time, thereby forming an Internet Protocol address set.
  • computing device 110 may obtain a set of Internet Protocol addresses associated with a site according to a domain name service provided by Domain Name System (DNS).
  • DNS Domain Name System
  • Domain name services can be provided by different operators, and the authority of top-level domain name queries provided by different operators will also vary. Accordingly, computing device 110 may use a more authoritative domain name service to obtain a set of Internet Protocol addresses associated with a site.
  • the computing device 110 may also acquire the Internet Protocol address set associated with the site through, for example, a search engine and a domain name query service provided by a company associated with the search engine.
  • computing device 110 determines a set of geographic features associated with the set of Internet Protocol addresses obtained at block 202 .
  • the geographical feature in the geographical feature set indicates the geographic location of the server associated with the site.
  • the computing device 110 may first obtain a correspondence table indicating the correspondence between the geographic location and the range of Internet protocol addresses, and then, based on the correspondence table, determine the set of regional features.
  • the correspondence table may indicate that there is a correspondence between the IP address range 23.248.192.1-23.248.192.255 and "North America". Therefore, when the IP address associated with the site is 23.248.192.3 or 23.248.192.27, it can be determined that these two IP addresses are associated with the geographic location of "North America".
  • the computing device 110 determines whether the site is a bad site based on the set of geographic features determined at block 204 . According to one or more embodiments of the present disclosure, the computing device 110 may determine whether a site belongs to a bad site according to information such as whether the geographical features in the geographical feature set include foreign geographic locations and the number of foreign geographic locations included.
  • FIG. 3 shows a flowchart of a site evaluation method 300 according to an embodiment of the present disclosure.
  • the site assessment method 300 may be performed by the computing device 110 in the site assessment environment 100 shown in FIG. 1 . It should be understood that the site evaluation method 300 may also include additional operations not shown and/or operations shown may be omitted, and the scope of the present disclosure is not limited in this regard.
  • computing device 110 obtains a set of Internet Protocol addresses associated with a site.
  • the content involved in block 302 is the same as that involved in block 202, and will not be repeated here.
  • computing device 110 determines a set of geographic features associated with the set of Internet Protocol addresses obtained at block 202 .
  • the content involved in block 304 is the same as that involved in block 204, and will not be repeated here.
  • the computing device 110 determines whether the geographic locations indicated by the geographic features in the set of geographic features determined at block 304 are all domestic geographic locations. If it is determined at block 306 that the geographic locations indicated by the geographic features in the geographic feature set are all domestic geographic locations, the method 300 proceeds to block 308 ; otherwise, the method 300 proceeds to block 310 .
  • generally bad sites use foreign servers for the purpose of evading management and control. Therefore, if the Internet protocol addresses determined for a site are all associated with domestic geographic locations, the site is generally not a bad site.
  • computing device 110 determines that the site is not a bad site. As previously mentioned, because when the Internet protocol addresses determined for a certain site are all associated with domestic address locations, this site is usually not a bad site, so when the determination result of block 306 is "yes", the calculation The device 110 may determine in block 308 that the site does not belong to a bad site, and at this point the process of the method 300 ends.
  • computing device 110 determines whether the site belongs to a normal foreign language site. If it is determined at block 310 that the site belongs to a normal foreign language site, method 300 proceeds to block 308 ; otherwise, method 300 proceeds to block 312 .
  • the server of a normal foreign language site is usually located in a foreign geographic location, so the Internet protocol address determined for the site at this time will be associated with the foreign geographic location. In this case, it should be determined that the site does not belong to the bad site based on the site being a normal foreign language site.
  • computing device 110 may determine in block 308 that the sites do not belong to bad sites, at this time the flow of method 300 Finish.
  • the computing device 100 may determine whether the site belongs to a normal foreign language site through the domain name of the site. For example, if the domain name of the site includes suffixes such as .us or .jp, it indicates that the site is a foreign language site in the United States or Japan, and at this time it can be considered that the site belongs to a normal foreign language site.
  • suffixes such as .us or .jp
  • the computing device 110 may first obtain the domain name associated with the site, and then determine whether the site belongs to a normal foreign language site based on the domain name.
  • the computing device 110 may first obtain the domain name associated with the site, then determine the country associated with the site based on the domain name, and then determine the site based on matching the determined country with the foreign geographic location associated with the site It is a normal foreign language site. For example, if a site's domain name includes the suffix .jp, this indicates that the site is a foreign-language site in Japan.
  • the IP address associated with this site is 23.248.192.3 at this time, it can be determined that the IP address associated with this site is consistent with the geographical location of "North America” according to the embodiment described for block 204. associated with the location. At this time, since the determined country “Japan” does not match the determined geographic location "North America", it cannot be determined that this site belongs to a normal foreign language site.
  • the computing device 110 may determine whether the site belongs to a normal foreign language site based on the proportion of Chinese content included in the site. For example, computing device 110 may first determine the ratio of Chinese content included in the site to all content included in the site, and when the determined ratio is less than a threshold ratio, eg, 5%, determine that the site belongs to a normal foreign language site.
  • a threshold ratio eg, 5%
  • the computing device 110 may not need to obtain all the Chinese content and all the content included in the site, but may sample the site, for example, Randomly extract the form of the corresponding content in the predetermined number of pages included in the site, and determine the ratio of the Chinese content included in the site to all the content included in the site by analogy. For example, if the ratio of Chinese content to all content is 4% among the 10 pages of the randomly extracted site, it can be determined by analogy that for the entire site, the proportion of Chinese content included in the site to all content included in the site is also the same. 4%.
  • computing device 110 determines a number of foreign geographic locations included in the geographic location indicated by the geographic location in the set of geographic attributes determined at block 304 .
  • the computing device 110 may obtain the Internet Protocol address associated with the site for the same site multiple times within a predetermined period of time, thereby forming an Internet Protocol address set. Therefore, the geographic locations indicated by the geographic features in the geographic feature set may include multiple foreign geographic locations. At this time, it may be determined whether the site belongs to a bad site according to the number of foreign geographic locations included in the geographic location indicated by the geographic feature set in the geographic feature set.
  • computing device 110 determines whether the number of foreign geographic locations included in the geographic location indicated by the geographic feature in the set of geographic features determined at block 312 is greater than a threshold number. If it is determined at block 314 that the geographic location indicated by the geographic feature in the geographic feature set includes more than the threshold number of foreign geographic locations, then the method 300 proceeds to block 316 ; otherwise, the method 300 proceeds to block 318 .
  • the number of foreign geographical locations included in the geographical location indicated by the geographical feature in the geographical feature set is greater than a threshold number, for example, 3, it can be considered that this site exists.
  • Protocol pool behavior refers to the Internet protocol addresses associated with the site obtained for the same site multiple times within a predetermined period of time, including more, for example, 10, Internet protocol addresses, and these Internet protocol addresses are associated with multiple, such as , the situation that 3 different foreign geographic locations are associated. In this case, it can be considered that the owner of the site is likely to use the Internet protocol pool to evade control, so it can be considered that this site is a bad site.
  • the following multiple IP addresses may be obtained within a predetermined period of time:
  • the corresponding relationship between the above-mentioned IP address and the geographic location can be determined as:
  • the site wap4.173kxs.com should be a bad site
  • computing device 110 determines that the site belongs to a bad site. As mentioned above, when the number of foreign geographical locations included in the geographical location indicated by the geographical characteristics in the geographical characteristic set determined for a certain site is greater than the threshold number, this site will use the Internet protocol pool behavior However, it is determined to belong to a bad site, so when the determination result in block 314 is “Yes”, the computing device 110 may determine in block 316 that the site belongs to a bad site, and the flow of the method 300 ends at this point.
  • the computing device 110 determines whether the site is a bad site based on the content that the site includes. According to one or more embodiments of the present disclosure, when it is still not possible to determine whether the site is a bad site through blocks 302 to 316 , it may be determined based on the content included in the site whether the site is a bad site.
  • a site may be determined to be bad when it includes content that indicates at least one of the following:
  • the content included in the site includes illegal information, for example, the content included in the site includes information such as pornography, gambling, and cults;
  • the content included in the site involves bad collection, for example, the content included in the site includes content that was scraped by crawlers for other sites and then pieced together;
  • the content included in the site involves pirated editions, for example, the content included in the site includes pirated novels, comics, etc.;
  • the content included in the site involves traffic, for example, the content included in the site includes the behavior of attracting traffic to other sites;
  • the content included in the site involves traffic hijacking, for example, the content included in the site includes traffic diversion through Trojan horses and other forms;
  • the content included in the site refers to the resource overflow situation, for example, the content included in the site refers to water stickers.
  • the content included in the site may be determined to indicate at least one of the above items by performing semantic recognition or image recognition on the content included in the site.
  • method 300 includes more steps than method 200 and may be considered an extension of method 200 .
  • the above describes the site assessment environment 100 in which the site assessment method in some embodiments of the present disclosure can be implemented, the site assessment method 200 according to the embodiment of the present disclosure, and the site assessment method according to the embodiment of the present disclosure with reference to FIGS. 1 to 3 .
  • Relevant content of site evaluation method 300 It should be understood that the above description is for better displaying the content recorded in the present disclosure, rather than limiting in any way.
  • Fig. 4 is a schematic block diagram of a site evaluation device 400 according to an embodiment of the present disclosure.
  • the site evaluation apparatus 400 may include: a first acquiring module 410 configured to acquire an Internet Protocol address set associated with a site; a first determining module 420 configured to determine the The regional feature set, the regional feature in the regional feature set indicates the geographic location of the server associated with the site; and the second determination module 430 is configured to determine whether the site belongs to a bad site based on the regional feature set.
  • the first determining module 420 includes: a second obtaining module (not shown), configured to obtain a correspondence table, the correspondence table indicates the correspondence between the geographic location and the Internet protocol address range relationship; and a third determination module (not shown), configured to determine the regional feature set based on the correspondence table.
  • the second determination module 430 includes: a fourth determination module (not shown), configured to if the geographic locations indicated by the geographic features in the geographic feature set are all domestic geographic locations, then Make sure the site isn't bad.
  • the second determining module 430 includes: a fifth determining module (not shown), configured to determine that the geographic location indicated by the geographic feature in the geographic feature set includes a foreign geographic location; the sixth A determination module (not shown) is configured to determine whether the site belongs to a normal foreign language site; and a seventh determination module (not shown) is configured to determine that the site does not belong to a bad site if it is determined that the site belongs to a normal foreign language site.
  • the sixth determination module includes: a third acquisition module (not shown), configured to acquire a domain name associated with the site; and an eighth determination module (not shown), configured To determine whether a site belongs to a normal foreign language site based on the domain name.
  • the eighth determination module includes: a ninth determination module (not shown), configured to determine the country associated with the site based on the domain name; and a tenth determination module (not shown ), is configured to determine that the site belongs to a normal foreign language site if the country matches the foreign geographical location.
  • the sixth determination module includes: an eleventh determination module (not shown), configured to determine the ratio of the Chinese content included in the site to the total content included in the site; and the tenth The second determining module (not shown) is configured to determine that the site belongs to a normal foreign language site if the ratio is less than the threshold ratio.
  • the second determining module 430 includes: a thirteenth determining module (not shown), configured to determine foreign geographic locations included in the geographic locations indicated by the geographic features in the geographic feature set number; and a fourteenth determination module (not shown), configured to determine whether a site belongs to a bad site based on the number.
  • the fourteenth determining module includes: a fifteenth determining module (not shown), configured to determine that the station belongs to a bad station if the number is greater than a threshold number.
  • the fourteenth determination module includes: a sixteenth determination module (not shown), configured to determine that the number is less than or equal to the threshold number; and a seventeenth determination module (not shown), Configured to determine whether a site is inappropriate based on the content it contains.
  • the technical solution according to the embodiment of the present disclosure has many advantages compared with the traditional solution. For example, using the technical solution of the method, it is possible to judge whether a site is a bad site based on the Internet protocol address associated with the site, thereby reducing the cost of judging bad sites, and thus improving the quality and quality of site link capture and site library maintenance. efficiency.
  • the total number of sites on the Internet may reach more than 3 billion, and the geographical characteristics of about 20% of the sites indicate that the sites are associated with foreign geographical locations, and among these 20% of the sites, about 80% have the use of Internet protocol pools , about 10% are normal foreign language sites. Therefore, through the technical solutions according to the embodiments of the present disclosure, it is possible to efficiently determine whether about 600 million sites belong to bad sites through the Internet Protocol addresses of the sites, which cannot be easily realized by using traditional solutions.
  • FIG. 5 shows a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure.
  • the computing device 110 shown in FIG. 1 and the site evaluation apparatus 400 shown in FIG. 4 may be implemented by the electronic device 500 .
  • Electronic device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 500 includes a computing unit 501 that can execute according to a computer program stored in a read-only memory (ROM) 502 or loaded from a storage unit 508 into a random-access memory (RAM) 503. Various appropriate actions and treatments. In the RAM 503, various programs and data necessary for the operation of the device 500 can also be stored.
  • the computing unit 501, ROM 502, and RAM 503 are connected to each other through a bus 504.
  • An input/output (I/O) interface 505 is also connected to the bus 504 .
  • the I/O interface 505 includes: an input unit 506, such as a keyboard, a mouse, etc.; an output unit 507, such as various types of displays, speakers, etc.; a storage unit 508, such as a magnetic disk, an optical disk, etc. ; and a communication unit 509, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 509 allows the device 500 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 501 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 501 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the computing unit 501 executes various methods and processes described above, such as the site evaluation methods 200 and 300 .
  • site assessment methods 200 and 300 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508 .
  • part or all of the computer program may be loaded and/or installed on the device 500 via the ROM 502 and/or the communication unit 509.
  • the computer program When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of site assessment methods 200 and 300 described above may be performed.
  • the computing unit 501 may be configured to execute the site assessment methods 200 and 300 in any other suitable manner (for example, by means of firmware).
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本公开提供了一种站点评估方法(200)、装置、电子设备、存储介质和程序产品,涉及网络安全领域和内容推荐领域,可应用于站点链接抓取和站点库维护场景。该方法包括:获取与站点相关联的因特网协议地址集合(202);确定与因特网协议地址集合相关联的地域特征集合(204),地域特征集合中的地域特征指示与站点相关联的服务器所处的地理位置;以及基于地域特征集合来确定站点是否属于不良站点(206)。利用上述方法,可以基于与站点相关联的因特网协议地址来判断站点是否属于不良站点,从而可以降低判断不良站点的成本,因此能够提高站点链接抓取和站点库维护的质量和效率。

Description

站点评估方法、装置、电子设备、存储介质和程序产品 技术领域
本公开涉及网络安全领域和内容推荐领域,可应用于站点链接抓取和站点库维护场景,并且更具体地,涉及站点评估方法、站点评估装置、电子设备、计算机可读存储介质和计算机程序产品。
背景技术
在因特网上每天都会新增数以万记甚至更多的域名,站点的新生和消逝则更是在频繁的交替中。面对普通网民来说,对于存在数年甚至数十年的知名站点通常难以感知到它们的生存周期,但针对因特网而言,对站点的存在记忆则是短暂的。随着创建和维护站点技术的快速发展,一些人会批量创建站点,采取多线条并进的方式进行黑灰产的资源生产,并且为了逃避管控也经常会通过频繁的更换站点来提供服务。这些站点通常包括无价值的垃圾内容以及不良内容,因此也会被称为不良站点。如果放任这些不良站点进入正常因特网生态中并展现在大众的视野当中,不仅会降低用户对因特网的体验,还会在一定程度上助长不良信息的传播。同时,如果站点库中包括过多的不良站点也会严重影响用户的查询体验。
然而,传统的用于站点评估的技术无法高质量和高效地解决上述问题。
发明内容
根据本公开的实施例,提供了一种站点评估方法、站点评估装置、电子设备、计算机可读存储介质和计算机程序产品。
在本公开的第一方面中,提供了一种站点评估方法,包括:获取与站点相关联的因特网协议地址集合;确定与因特网协议地址集合相关联的地域特征集合,地域特征集合中的地域特征指示与站点相关联的服务器所处的地理位置;以及基于地域特征集合来确定站点是否属于不良站 点。
在本公开的第二方面中,提供了一种站点评估装置,包括:第一获取模块,被配置为获取与站点相关联的因特网协议地址集合;第一确定模块,被配置为确定与因特网协议地址集合相关联的地域特征集合,地域特征集合中的地域特征指示与站点相关联的服务器所处的地理位置;以及第二确定模块,被配置为基于地域特征集合来确定站点是否属于不良站点。
在本公开的第三方面中,提供了一种电子设备,包括至少一个处理器;以及与至少一个处理器通信连接的存储器;其中存储器存储有可被至少一个处理器执行的指令,该指令被至少一个处理器执行,以使至少一个处理器能够实现根据本公开的第一方面的方法。
在本公开的第四方面中,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,计算机指令用于使计算机实现根据本公开的第一方面的方法。
在本公开的第五方面中,提供了一种计算机程序产品,包括计算机程序,计算机程序在被处理器执行时,执行根据本公开的第一方面的方法。
利用根据本申请的技术,提供了一种站点评估方法,利用该方法的技术方案,可以基于与站点相关联的因特网协议地址来判断站点是否属于不良站点,从而可以降低判断不良站点的成本,因此能够提高站点链接抓取和站点库维护的质量和效率。
应当理解,发明内容部分中所描述的内容并非旨在限定本公开的实施例的关键或重要特征,亦非用于限制本公开的范围。本公开的其它特征将通过以下的描述变得容易理解。
附图说明
通过结合附图对本公开示例性实施例进行更详细的描述,本公开的上述以及其它目的、特征和优势将变得更加明显,其中在本公开示例性实施例中,相同的参考标号通常代表相同部件。应当理解,附图用于更好地理解本方案,不构成对本公开的限定。其中:
图1示出了可以在其中实现本公开的某些实施例中的站点评估方法的站点评估环境100的示意性框图;
图2示出了根据本公开实施例的站点评估方法200的流程图;
图3示出了根据本公开实施例的站点评估方法300的流程图;
图4示出了根据本公开的实施例的站点评估装置400的示意性框图;以及
图5示出了可以用来实施本公开的实施例的示例电子设备500的示意性框图。
在各个附图中,相同或对应的标号表示相同或对应的部分。
具体实施方式
下面将参照附图更详细地描述本公开的优选实施例。虽然附图中显示了本公开的优选实施例,然而应该理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个示例实施例”和“一个实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
如以上在背景技术中所描述的,传统的用于站点评估的技术无法高质量和高效地解决上述问题。具体而言,在传统方案中,采用的是纯基于分类模型的方法。使用分类模型来进行站点评估实质上是基于站点中所包括的内容,为此,针对每个站点,可能需要从中提取十条、百条甚至更多条信息来支持使用分类模型进行站点评估。因此,如果需要针对因特网中的30亿个站点进行站点评估,就可能需要针对300-3000亿条信息使用分类模型来进行判断。由此可见,传统的用于站点评估的技术效率较低。同时,由于从站点提取的信息的形式和内容分类非常多并且 相互之间的区别很大,因此分类模型在评估站点时也会存在质量上的问题。
为了至少部分地解决上述问题以及其他潜在问题中的一个或者多个问题,考虑到大部分旨在生产垃圾内容以及不良内容的人所掌控的站点都具有一定的特征,本申请提出了一种站点评估方法,利用该方法的技术方案,可以基于与站点相关联的因特网协议地址来判断站点是否属于不良站点,从而可以降低判断不良站点的成本,因此能够提高站点链接抓取和站点库维护的质量和效率。
图1示出了可以在其中实现本公开的某些实施例中的站点评估方法的站点评估环境100的示意性框图。根据本公开的一个或多个实施例,站点评估环境100可以是云环境。如图1中所示,站点评估环境100包括计算设备110。在站点评估环境100中,评估相关数据120作为计算设备110的输入被提供给计算设备110。评估相关数据120例如可以包括站点的域名或者与站点相关联的因特网协议地址集合。根据本公开的一个或多个实施例,针对同一站点在不同时间查询与站点相关联的因特网协议地址可能会得到不同的因特网协议地址。因此,可以在预定时间段内多次针对同一站点获取与该站点相关联的因特网协议地址,从而形成因特网协议地址集合。因特网协议地址集合中可以包括一个或多个因特网协议地址。
在获取了与站点相关联的因特网协议地址集合之后,计算设备110可以确定与因特网协议地址集合相关联的地域特征集合。根据本公开的一个或多个实施例,地域特征集合中的地域特征指示与站点相关联的服务器所处的地理位置。
在确定了与站点以及因特网协议地址集合相关联的地域特征集合之后,计算设备110可以基于地域特征集合来确定站点是否属于不良站点。
应当理解,与站点以及因特网协议地址集合相关联的地域特征集合同样可以作为评估相关数据120的一部分而被直接提供给计算设备110。此时,计算设备110可以无需确定这一地域特征集合,而是可以直接利用这一地域特征集合来确定站点是否属于不良站点。
根据本公开的一个或多个实施例,评估相关数据120还可以包括用 于确定地域特征集合的对应关系表,该对应关系表指示地理位置与因特网协议地址范围之间的对应关系。此时,计算设备110可以基于该对应关系表,利用所获取的因特网协议地址集合来确定地域特征集合。
根据本公开的一个或多个实施例,评估相关数据120还可以包括与站点相关联的中文内容和/或全部内容。此时,计算设备110可以在地域特征集合的基础上,进一步基于与站点相关联的中文内容和/或全部内容合来确定站点是否属于不良站点。
应当理解,站点评估环境100仅仅是示例性而不是限制性的,并且其是可扩展的,其中可以包括更多的计算设备110,并且可以向计算设备110提供更多的评估相关数据120作为输入,从而使得可以满足更多用户同时利用更多的计算设备110,甚至利用更多的评估相关数据120来同时或者非同时地对更多的站点进行评估,以确定这些站点是否属于不良站点的需求。
在图1所示的站点评估环境100中,向计算设备110输入评估相关数据120可以通过网络来进行。
图2示出了根据本公开的实施例的站点评估方法200的流程图。具体而言,站点评估方法200可以由图1中所示的站点评估环境100中的计算设备110来执行。应当理解的是,站点评估方法200还可以包括未示出的附加操作和/或可以省略所示出的操作,本公开的范围在此方面不受限制。
在框202,计算设备110获取与站点相关联的因特网协议地址集合。根据本公开的一个或多个实施例,计算设备110可以在预定时间段内多次针对同一站点获取与该站点相关联的因特网协议地址,从而形成因特网协议地址集合。
根据本公开的一些实施例,计算设备110可以根据域名系统(DNS)所提供的域名服务来获取与站点相关联的因特网协议地址集合。域名服务可以由不同的运营商提供,并且不同运营商所提供的顶级域名查询的权威性也会存在差别。因此,计算设备110可以使用权威性较高的域名服务来获取与站点相关联的因特网协议地址集合。
根据本公开的另一些实施例,计算设备110也可以通过例如搜索引 擎等形式以及由与搜索引擎相关联的公司提供的域名查询服务来获取与站点相关联的因特网协议地址集合。
在框204,计算设备110确定与在框202获取的因特网协议地址集合相关联的地域特征集合。根据本公开的一个或多个实施例,地域特征集合中的地域特征指示与站点相关联的服务器所处的地理位置。
根据本公开的一个或多个实施例,计算设备110可以首先获取指示地理位置与因特网协议地址范围之间的对应关系的对应关系表,并且而后基于对应关系表来确定与因特网协议地址集合相关联的地域特征集合。例如,对应关系表可以指示IP地址范围23.248.192.1-23.248.192.255与“北美地区”存在对应关系。因此,当与站点相关联的因特网协议地址为23.248.192.3或者23.248.192.27时,可以确定这两个因特网协议地址与“北美地区”这一地理位置相关联。
在框206,计算设备110基于在框204确定的地域特征集合来确定站点是否属于不良站点。根据本公开的一个或多个实施例,计算设备110可以根据地域特征集合中的地域特征是否包括国外地理位置以及所包括的国外地理位置的数目等信息来确定站点是否属于不良站点。
图3示出了根据本公开实施例的站点评估方法300的流程图。具体而言,站点评估方法300可以由图1中所示的站点评估环境100中的计算设备110来执行。应当理解的是,站点评估方法300还可以包括未示出的附加操作和/或可以省略所示出的操作,本公开的范围在此方面不受限制。
在框302,计算设备110获取与站点相关联的因特网协议地址集合。框302所涉及的内容与框202所涉及的内容相同,在此不再赘述。
在框304,计算设备110确定与在框202获取的因特网协议地址集合相关联的地域特征集合。框304所涉及的内容与框204所涉及的内容相同,在此不再赘述。
在框306,计算设备110确定在框304确定的地域特征集合中的地域特征所指示的地理位置是否均为国内地理位置。如果在框306确定地域特征集合中的地域特征所指示的地理位置均为国内地理位置,则方法300前进到框308;否则,方法300前进到框310。
根据本公开的一个或多个实施例,一般而言不良站点出于逃避管控等目的会使用国外的服务器。因此,如果针对某个站点所确定的因特网协议地址均与国内地理位置相关联,则这一站点通常不会是不良站点。
在框308,计算设备110确定站点不属于不良站点。如前所述,由于当针对某个站点所确定的因特网协议地址均与国内地址位置相关联时,这一站点通常不会是不良站点,因此当框306的判定结果为“是”时,计算设备110可以在框308确定站点不属于不良站点,此时方法300的流程结束。
在框310,计算设备110确定站点是否属于正常外文站点。如果在框310确定站点属于正常外文站点,则方法300前进到框308;否则,方法300前进到框312。根据本公开的一个或多个实施例,正常外文站点的服务器通常会设在国外地理位置,因此此时针对站点所确定的因特网协议地址会与国外地理位置相关联。在这种情况下,应当基于站点属于正常外文站点来确定站点不属于不良站点。
如前所述,由于属于正常外文站点的站点不属于不良站点,因此当框310的判定结果为“是”时,计算设备110可以在框308确定站点不属于不良站点,此时方法300的流程结束。
根据本公开的一个或多个实施例,计算设备100可以通过站点的域名来确定站点是否属于正常外文站点。例如,如果站点的域名包括.us或者.jp之类的后缀,则说明这一站点是美国或者日本的外文站点,此时可以认为这一站点属于正常外文站点。
因此,根据本公开的一些实施例,计算设备110可以首先获取与站点相关联的域名,并且而后基于该域名来确定站点是否属于正常外文站点。
根据本公开的另一些实施例,即使确定了站点为外文站点,也需要考虑与外文站点相关联的国别和与站点相关联的地理位置是否匹配,并且只有当外文站点的国别和与站点相关联的地理位置匹配时,才确定站点属于正常外文站点。此时,计算设备110可以首先获取与站点相关联的域名,而后基于域名来确定与站点相关联的国别,并且而后基于所确定的国别和与站点相关联的国外地理位置匹配来确定站点属于正常外文 站点。例如,如果某站点的域名包括.jp的后缀,则这说明这一站点是日本的外文站点。然而,如果此时与这一站点相关联的因特网协议地址为23.248.192.3,则按照之前针对框204描述的实施例可以确定与这一站点相关联的因特网协议地址与“北美地区”这一地理位置相关联。此时,由于所确定的国别“日本”与所确定的地理位置“北美地区”并不匹配,则不可以确定这一站点属于正常外文站点。
根据本公开的又一些实施例,计算设备110可以基于站点所包括的中文内容的比例来确定站点是否属于正常外文站点。例如,计算设备110可以首先确定站点所包括的中文内容与站点所包括的全部内容的比例,并且在所确定的比例小于阈值比例,例如,5%,时,确定站点属于正常外文站点。应当理解,在确定站点所包括的中文内容与站点所包括的全部内容的比例时,计算设备110可以无需获取站点所包括的全部中文内容以及全部内容,而是可以通过对站点进行采样,例如,随机提取站点所包括的预定数目的页面中的相应内容的形式,来类推地确定站点所包括的中文内容与站点所包括的全部内容的比例。例如,如果在随机提取的站点的10个页面中,中文内容与全部内容的比例为4%,则可以类推地确定针对整个站点,站点所包括的中文内容与站点所包括的全部内容的比例也为4%。
在框312,计算设备110确定在框304确定的地域特征集合中的地域特征所指示的地理位置所包括的国外地理位置的数目。根据本公开的一个或多个实施例,计算设备110可以在预定时间段内多次针对同一站点获取与该站点相关联的因特网协议地址,从而形成因特网协议地址集合。因此,地域特征集合中的地域特征所指示的地理位置可能包括多个国外地理位置。此时,可以根据地域特征集合中的地域特征所指示的地理位置包括的国外地理位置的数目来确定站点是否属于不良站点。
在框314,计算设备110确定在框312确定的、地域特征集合中的地域特征所指示的地理位置所包括的国外地理位置的数目是否大于阈值数目。如果在框314确定地域特征集合中的地域特征所指示的地理位置所包括的国外地理位置的数目大于阈值数目,则方法300前进到框316;否则,方法300前进到框318。
根据本公开的一个或多个实施例,地域特征集合中的地域特征所指示的地理位置所包括的国外地理位置的数目大于阈值数目,例如,3个,时,可以认为这一站点存在使用因特网协议池行为。因特网协议池是指在预定时间段内多次针对同一站点获取的该站点相关联的因特网协议地址中,包括较多,例如,10个,因特网协议地址,并且这些因特网协议地址与多个,例如,3个不同国外地理位置相关联的情况。在这种情况下,可以认为站点的所有者很有可能采取使用因特网协议池的方式来逃避管控,因此可以认为这一站点属于不良站点。
例如,针对站点wap4.173kxs.com,在预定时间段内可能获取到如下多个因特网协议地址:
23.248.192.27
23.235.160.147
103.37.3.50
156.234.80.107
23.248.192.3
23.226.55.155
23.248.192.35
156.234.80.234
103.75.47.123
103.37.0.170
23.248.199.139
43.241.46.91
23.248.192.11
43.241.46.99
23.248.196.234。
而后,针对上述因特网协议地址,可以基于指示地理位置与因特网协议地址范围之间的对应关系的对应关系表,确定上述因特网协议地址与地理位置的对应关系为:
23.248.192.27[北美地区]
23.235.160.147[日本]
103.37.3.50[欧洲]
156.234.80.107[日本]
23.248.192.3[北美地区]
23.226.55.155[欧洲]
23.248.192.35[北美地区]
156.234.80.234[日本]
103.75.47.123[韩国]
103.37.0.170[欧洲]
23.248.199.139[北美地区]
43.241.46.91[欧洲]
23.248.192.11[北美地区]
43.241.46.99[欧洲]
23.248.196.234[北美地区]。
此时,由于与站点wap4.173kxs.com相关联的因特网协议地址包括多达15个,并且其中包括4个不同的国外地理位置“北美地区”、“日本”、“欧洲”和“韩国”,因此站点wap4.173kxs.com应当属于不良站点
在框316,计算设备110确定站点属于不良站点。如前所述,由于当针对某个站点所确定的地域特征集合中的地域特征所指示的地理位置所包括的国外地理位置的数目大于阈值数目时,这一站点会由于存在使用因特网协议池行为而被确定为属于不良站点,因此当框314的判定结果为“是”时,计算设备110可以在框316确定站点属于不良站点,此时方法300的流程结束。
在框318,计算设备110基于站点所包括的内容来确定站点是否属于不良站点。根据本公开的一个或多个实施例,当通过框302至316仍然无法确定站点是否属于不良站点时,可以基于站点所包括的内容来确定站点是否属于不良站点。
例如,当站点所包括的内容指示以下至少一项时,可以确定站点属于不良站点:
站点所包括的内容包括违法信息,例如,站点所包括的内容包括色 情、博彩、邪教等信息;
站点所包括的内容涉及恶劣采集,例如,站点所包括的内容包括针对其他站点通过爬虫抓取而后拼凑而形成的内容;
站点所包括的内容涉及盗版,例如,站点所包括的内容包括盗版的小说、漫画等;
站点所包括的内容涉及引流,例如,站点所包括的内容包括向其他站点的引流行为;
站点所包括的内容涉及流量劫持,例如,站点所包括的内容包括通过木马等形式实现的流量转移情况;
站点所包括的内容涉及资源泛滥情况,例如,站点所包括的内容涉及水贴。
应当理解,可以通过对站点所包括的内容进行语义识别或者图像识别等形式来确定站点所包括的内容指示上述至少一项。
应当理解,方法300包括比方法200更多的步骤,并且可以被认为是对方法200的扩展。
以上参考图1至图3描述了与可以在其中实现本公开的某些实施例中的站点评估方法的站点评估环境100、根据本公开实施例的站点评估方法200、以及根据本公开实施例的站点评估方法300的相关内容。应当理解,上述描述是为了更好地展示本公开中所记载的内容,而不是以任何方式进行限制。
应当理解,本公开的上述各个附图中所采用的各种元件的数目和物理量的大小仅为举例,而并不是对本公开的保护范围的限制。上述数目和大小可以根据需要而被任意设置,而不会对本公开的实施方式的正常实施产生影响。
上文已经参见图1至图3描述了根据本公开的实施方式的站点评估方法200和站点评估方法300的细节。在下文中,将参见图4描述站点评估装置中的各个模块。
图4是根据本公开实施例的站点评估装置400的示意性框图。如图4所示,站点评估装置400可以包括:第一获取模块410,被配置为获取与站点相关联的因特网协议地址集合;第一确定模块420,被配置为确 定与因特网协议地址集合相关联的地域特征集合,地域特征集合中的地域特征指示与站点相关联的服务器所处的地理位置;以及第二确定模块430,被配置为基于地域特征集合来确定站点是否属于不良站点。
在一个或多个实施例中,其中第一确定模块420包括:第二获取模块(未示出),被配置为获取对应关系表,对应关系表指示地理位置与因特网协议地址范围之间的对应关系;以及第三确定模块(未示出),被配置为基于对应关系表来确定地域特征集合。
在一个或多个实施例中,其中第二确定模块430包括:第四确定模块(未示出),被配置为如果地域特征集合中的地域特征所指示的地理位置均为国内地理位置,则确定站点不属于不良站点。
在一个或多个实施例中,其中第二确定模块430包括:第五确定模块(未示出),被配置为确定地域特征集合中的地域特征所指示的地理位置包括国外地理位置;第六确定模块(未示出),被配置为确定站点是否属于正常外文站点;以及第七确定模块(未示出),被配置为如果确定站点属于正常外文站点,则确定站点不属于不良站点。
在一个或多个实施例中,其中第六确定模块包括:第三获取模块(未示出),被配置为获取与站点相关联的域名;以及第八确定模块(未示出),被配置为基于域名来确定站点是否属于正常外文站点。
在一个或多个实施例中,其中第八确定模块包括:第九确定模块(未示出),被配置为基于域名来确定与站点相关联的国别;以及第十确定模块(未示出),被配置为如果国别与国外地理位置匹配,则确定站点属于正常外文站点。
在一个或多个实施例中,其中第六确定模块包括:第十一确定模块(未示出),被配置为确定站点所包括的中文内容与站点所包括的全部内容的比例;以及第十二确定模块(未示出),被配置为如果比例小于阈值比例,则确定站点属于正常外文站点。
在一个或多个实施例中,其中第二确定模块430包括:第十三确定模块(未示出),被配置为确定地域特征集合中的地域特征所指示的地理位置所包括的国外地理位置的数目;以及第十四确定模块(未示出),被配置为基于数目来确定站点是否属于不良站点。
在一个或多个实施例中,其中第十四确定模块包括:第十五确定模块(未示出),被配置为如果数目大于阈值数目,则确定站点属于不良站点。
在一个或多个实施例中,其中第十四确定模块包括:第十六确定模块(未示出),被配置为确定数目小于等于阈值数目;以及第十七确定模块(未示出),被配置为基于站点所包括的内容来确定站点是否属于不良站点。
通过以上参考图1至图4的描述,根据本公开的实施方式的技术方案相对于传统方案具有诸多优点。例如,利用该方法的技术方案,可以基于与站点相关联的因特网协议地址来判断站点是否属于不良站点,从而可以降低判断不良站点的成本,因此能够提高站点链接抓取和站点库维护的质量和效率。
经过统计,因特网中的站点总数可能达到30亿以上,其中约20%的站点的地域特征指示站点与外国地理位置相关联,在这20%的站点中,约80%具有使用因特网协议池的情况,约10%为正常外文站点。因此,通过根据本公开的实施方式的技术方案,可以高效地通过站点的因特网协议地址来确定约6亿站点是否属于不良站点,这采用传统方案无法被容易地实现。
图5示出了可以用来实施本公开的实施例的示例电子设备500的示意性框图。例如,如图1所示的计算设备110和如图4所示的站点评估装置400可以由电子设备500来实施。电子设备500旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图5所示,设备500包括计算单元501,其可以根据存储在只读存储器(ROM)502中的计算机程序或者从存储单元508加载到随机访问存储器(RAM)503中的计算机程序,来执行各种适当的动作和处理。 在RAM 503中,还可存储设备500操作所需的各种程序和数据。计算单元501、ROM 502以及RAM 503通过总线504彼此相连。输入/输出(I/O)接口505也连接至总线504。
设备500中的多个部件连接至I/O接口505,包括:输入单元506,例如键盘、鼠标等;输出单元507,例如各种类型的显示器、扬声器等;存储单元508,例如磁盘、光盘等;以及通信单元509,例如网卡、调制解调器、无线通信收发机等。通信单元509允许设备500通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算单元501可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元501的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元501执行上文所描述的各个方法和处理,例如站点评估方法200和300。例如,在一些实施例中,站点评估方法200和300可以被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元508。在一些实施例中,计算机程序的部分或者全部可以经由ROM 502和/或通信单元509而被载入和/或安装到设备500上。当计算机程序加载到RAM 503并由计算单元501执行时,可以执行上文描述的站点评估方法200和300的一个或多个步骤。备选地,在其他实施例中,计算单元501可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行站点评估方法200和300。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该 至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络 浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。

Claims (23)

  1. 一种站点评估方法,包括:
    获取与站点相关联的因特网协议地址集合;
    确定与所述因特网协议地址集合相关联的地域特征集合,所述地域特征集合中的地域特征指示与所述站点相关联的服务器所处的地理位置;以及
    基于所述地域特征集合来确定所述站点是否属于不良站点。
  2. 根据权利要求1所述的方法,其中确定所述地域特征集合包括:
    获取对应关系表,所述对应关系表指示地理位置与因特网协议地址范围之间的对应关系;以及
    基于所述对应关系表来确定所述地域特征集合。
  3. 根据权利要求1所述的方法,其中确定所述站点是否属于所述不良站点包括:
    如果所述地域特征集合中的地域特征所指示的地理位置均为国内地理位置,则确定所述站点不属于所述不良站点。
  4. 根据权利要求1所述的方法,其中确定所述站点是否属于所述不良站点包括:
    确定所述地域特征集合中的地域特征所指示的地理位置包括国外地理位置;
    确定所述站点是否属于正常外文站点;以及
    如果确定所述站点属于所述正常外文站点,则确定所述站点不属于所述不良站点。
  5. 根据权利要求4所述的方法,其中确定所述站点是否属于所述正常外文站点包括:
    获取与所述站点相关联的域名;以及
    基于所述域名来确定所述站点是否属于所述正常外文站点。
  6. 根据权利要求5所述的方法,其中确定所述站点是否属于所述正常外文站点包括:
    基于所述域名来确定与所述站点相关联的国别;以及
    如果所述国别与所述国外地理位置匹配,则确定所述站点属于所述正常外文站点。
  7. 根据权利要求4所述的方法,其中确定所述站点是否属于所述正常外文站点包括:
    确定所述站点所包括的中文内容与所述站点所包括的全部内容的比例;以及
    如果所述比例小于阈值比例,则确定所述站点属于所述正常外文站点。
  8. 根据权利要求1所述的方法,其中确定所述站点是否属于所述不良站点包括:
    确定所述地域特征集合中的地域特征所指示的地理位置所包括的国外地理位置的数目;以及
    基于所述数目来确定所述站点是否属于所述不良站点。
  9. 根据权利要求8所述的方法,其中确定所述站点是否属于所述不良站点包括:
    如果所述数目大于阈值数目,则确定所述站点属于所述不良站点。
  10. 根据权利要求8所述的方法,其中确定所述站点是否属于所述不良站点包括:
    确定所述数目小于等于阈值数目;以及
    基于所述站点所包括的内容来确定所述站点是否属于所述不良站点。
  11. 一种站点评估装置,包括:
    第一获取模块,被配置为获取与站点相关联的因特网协议地址集合;
    第一确定模块,被配置为确定与所述因特网协议地址集合相关联的地域特征集合,所述地域特征集合中的地域特征指示与所述站点相关联的服务器所处的地理位置;以及
    第二确定模块,被配置为基于所述地域特征集合来确定所述站点是否属于不良站点。
  12. 根据权利要求11所述的装置,其中所述第一确定模块包括:
    第二获取模块,被配置为获取对应关系表,所述对应关系表指示地 理位置与因特网协议地址范围之间的对应关系;以及
    第三确定模块,被配置为基于所述对应关系表来确定所述地域特征集合。
  13. 根据权利要求11所述的装置,其中所述第二确定模块包括:
    第四确定模块,被配置为如果所述地域特征集合中的地域特征所指示的地理位置均为国内地理位置,则确定所述站点不属于所述不良站点。
  14. 根据权利要求11所述的装置,其中所述第二确定模块包括:
    第五确定模块,被配置为确定所述地域特征集合中的地域特征所指示的地理位置包括国外地理位置;
    第六确定模块,被配置为确定所述站点是否属于正常外文站点;以及
    第七确定模块,被配置为如果确定所述站点属于所述正常外文站点,则确定所述站点不属于所述不良站点。
  15. 根据权利要求14所述的装置,其中所述第六确定模块包括:
    第三获取模块,被配置为获取与所述站点相关联的域名;以及
    第八确定模块,被配置为基于所述域名来确定所述站点是否属于所述正常外文站点。
  16. 根据权利要求15所述的装置,其中所述第八确定模块包括:
    第九确定模块,被配置为基于所述域名来确定与所述站点相关联的国别;以及
    第十确定模块,被配置为如果所述国别与所述国外地理位置匹配,则确定所述站点属于所述正常外文站点。
  17. 根据权利要求14所述的装置,其中所述第六确定模块包括:
    第十一确定模块,被配置为确定所述站点所包括的中文内容与所述站点所包括的全部内容的比例;以及
    第十二确定模块,被配置为如果所述比例小于阈值比例,则确定所述站点属于所述正常外文站点。
  18. 根据权利要求11所述的装置,其中所述第二确定模块包括:
    第十三确定模块,被配置为确定所述地域特征集合中的地域特征所指示的地理位置所包括的国外地理位置的数目;以及
    第十四确定模块,被配置为基于所述数目来确定所述站点是否属于所述不良站点。
  19. 根据权利要求18所述的装置,其中所述第十四确定模块包括:
    第十五确定模块,被配置为如果所述数目大于阈值数目,则确定所述站点属于所述不良站点。
  20. 根据权利要求18所述的装置,其中所述第十四确定模块包括:
    第十六确定模块,被配置为确定所述数目小于等于阈值数目;以及
    第十七确定模块,被配置为基于所述站点所包括的内容来确定所述站点是否属于所述不良站点。
  21. 一种电子设备,其特征在于,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-10中任一项所述的方法。
  22. 一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,所述计算机指令用于使所述计算机执行权利要求1-10中任一项所述的方法。
  23. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时,执行权利要求1-10中任一项所述的方法。
PCT/CN2022/086180 2021-08-30 2022-04-11 站点评估方法、装置、电子设备、存储介质和程序产品 WO2023029486A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111007121.5 2021-08-30
CN202111007121.5A CN113783855B (zh) 2021-08-30 2021-08-30 站点评估方法、装置、电子设备、存储介质和程序产品

Publications (1)

Publication Number Publication Date
WO2023029486A1 true WO2023029486A1 (zh) 2023-03-09

Family

ID=78840024

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/086180 WO2023029486A1 (zh) 2021-08-30 2022-04-11 站点评估方法、装置、电子设备、存储介质和程序产品

Country Status (2)

Country Link
CN (1) CN113783855B (zh)
WO (1) WO2023029486A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113783855B (zh) * 2021-08-30 2023-07-21 北京百度网讯科技有限公司 站点评估方法、装置、电子设备、存储介质和程序产品

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130297338A1 (en) * 2012-05-07 2013-11-07 Ingroove, Inc. Method for Evaluating the Health of a Website
CN107231447A (zh) * 2016-03-23 2017-10-03 北大方正集团有限公司 一种站点地域识别方法及系统
CN109522504A (zh) * 2018-10-18 2019-03-26 杭州安恒信息技术股份有限公司 一种基于威胁情报判别仿冒网站的方法
CN109543118A (zh) * 2018-11-12 2019-03-29 中国人民解放军战略支援部队信息工程大学 基于多层决策的Web地标可靠性评估方法及装置
EP3471045A1 (en) * 2017-10-12 2019-04-17 Oath Inc. Method and system for identifying fraudulent publisher networks
CN109787961A (zh) * 2018-12-24 2019-05-21 上海晶赞融宣科技有限公司 虚假流量的识别方法及装置、存储介质、服务器
WO2020155491A1 (zh) * 2019-01-31 2020-08-06 平安科技(深圳)有限公司 一种基于地理位置的智能解析域名方法及装置
CN113783855A (zh) * 2021-08-30 2021-12-10 北京百度网讯科技有限公司 站点评估方法、装置、电子设备、存储介质和程序产品

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354800A (zh) * 2016-08-26 2017-01-25 中国互联网络信息中心 一种基于多维度特征的不良网站检测方法
US10333944B2 (en) * 2016-11-03 2019-06-25 Microsoft Technology Licensing, Llc Detecting impossible travel in the on-premise settings
CN106503244A (zh) * 2016-11-08 2017-03-15 天津海量信息技术股份有限公司 一种统一资源定位符相似度的处理方法
CN109309668A (zh) * 2018-08-30 2019-02-05 浙江贰贰网络有限公司 网络站点验证方法、装置、系统、计算机设备和存储介质
CN109450853B (zh) * 2018-10-11 2022-02-18 深圳市腾讯计算机系统有限公司 恶意网站判定方法、装置、终端及服务器
CN113269394A (zh) * 2021-04-16 2021-08-17 合肥联宝信息技术有限公司 一种数据处理方法、装置、设备及可读存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130297338A1 (en) * 2012-05-07 2013-11-07 Ingroove, Inc. Method for Evaluating the Health of a Website
CN107231447A (zh) * 2016-03-23 2017-10-03 北大方正集团有限公司 一种站点地域识别方法及系统
EP3471045A1 (en) * 2017-10-12 2019-04-17 Oath Inc. Method and system for identifying fraudulent publisher networks
CN109522504A (zh) * 2018-10-18 2019-03-26 杭州安恒信息技术股份有限公司 一种基于威胁情报判别仿冒网站的方法
CN109543118A (zh) * 2018-11-12 2019-03-29 中国人民解放军战略支援部队信息工程大学 基于多层决策的Web地标可靠性评估方法及装置
CN109787961A (zh) * 2018-12-24 2019-05-21 上海晶赞融宣科技有限公司 虚假流量的识别方法及装置、存储介质、服务器
WO2020155491A1 (zh) * 2019-01-31 2020-08-06 平安科技(深圳)有限公司 一种基于地理位置的智能解析域名方法及装置
CN113783855A (zh) * 2021-08-30 2021-12-10 北京百度网讯科技有限公司 站点评估方法、装置、电子设备、存储介质和程序产品

Also Published As

Publication number Publication date
CN113783855B (zh) 2023-07-21
CN113783855A (zh) 2021-12-10

Similar Documents

Publication Publication Date Title
KR100848319B1 (ko) 웹 구조정보를 이용한 유해 사이트 차단 방법 및 장치
US10511618B2 (en) Website information extraction device, system website information extraction method, and website information extraction program
CN111092999A (zh) 一种数据请求处理方法和装置
WO2023029486A1 (zh) 站点评估方法、装置、电子设备、存储介质和程序产品
CN111435393A (zh) 对象漏洞的检测方法、装置、介质及电子设备
CN114139040A (zh) 一种数据存储及查询方法、装置、设备及可读存储介质
CN116611411A (zh) 一种业务系统报表生成方法、装置、设备及存储介质
US10313369B2 (en) Blocking malicious internet content at an appropriate hierarchical level
US20140129490A1 (en) Image url-based junk detection
CN110543783A (zh) 一种投票系统及其实现方法、设备及存储介质
CN117040799A (zh) 页面拦截规则生成、页面访问控制方法、装置及电子设备
CN116301978A (zh) 一种系统升级方法、装置、设备及存储介质
US20130230248A1 (en) Ensuring validity of the bookmark reference in a collaborative bookmarking system
CN116738369A (zh) 一种流量数据的分类方法、装置、设备及存储介质
WO2023115831A1 (zh) 应用程序的测试方法、装置、电子设备及存储介质
US20220200941A1 (en) Reputation Clusters for Uniform Resource Locators
CN112351009B (zh) 一种网络安全防护方法、装置、电子设备及可读存储介质
CN104933061B (zh) 字符串检测方法、装置及电子设备
CN114462030A (zh) 隐私政策的处理、取证方法、装置、设备及存储介质
CN113254942A (zh) 数据处理方法、系统及装置
CN113297087A (zh) 测试方法和装置
WO2021173581A1 (en) Automated actions in a security platform
CN115859349B (zh) 一种数据脱敏方法、装置、电子设备及存储介质
CN111782967A (zh) 信息处理方法、装置、电子设备和计算机可读存储介质
CN113128538A (zh) 网络行为分类方法、设备、存储介质及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22862632

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE