CN107786604B - Method and device for determining content server - Google Patents

Method and device for determining content server Download PDF

Info

Publication number
CN107786604B
CN107786604B CN201610767748.3A CN201610767748A CN107786604B CN 107786604 B CN107786604 B CN 107786604B CN 201610767748 A CN201610767748 A CN 201610767748A CN 107786604 B CN107786604 B CN 107786604B
Authority
CN
China
Prior art keywords
urls
url
target
condition
sequencing result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610767748.3A
Other languages
Chinese (zh)
Other versions
CN107786604A (en
Inventor
槐昱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
XFusion Digital Technologies Co Ltd
Original Assignee
Huawei Digital Technologies Suzhou Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Digital Technologies Suzhou Co Ltd filed Critical Huawei Digital Technologies Suzhou Co Ltd
Priority to CN201610767748.3A priority Critical patent/CN107786604B/en
Publication of CN107786604A publication Critical patent/CN107786604A/en
Application granted granted Critical
Publication of CN107786604B publication Critical patent/CN107786604B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/101Server selection for load balancing based on network conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses a method and a device for determining a content server, relates to the technical field of website detection, and aims to improve the efficiency of determining the content server. The method comprises the following steps: the gateway equipment acquires a website access record in a preset time period, wherein the website access record comprises N accessed URLs in the preset time period and access times corresponding to the N accessed URLs; the gateway equipment determines M target URLs in the N URLs according to the website access records, wherein the M target URLs are M URLs with the highest probability of being a content server in the N URLs; the gateway equipment accesses a Host corresponding to each URL in M target URLs, and receives M parameters returned by a plurality of target servers running the M hosts of the M target URLs, wherein one parameter comprises an HTTP return value and the number of bytes of return data; the gateway device determines the content servers in the M target URLs according to the M parameters.

Description

Method and device for determining content server
Technical Field
The invention relates to the technical field of website detection, in particular to a method and a device for determining a content server.
Background
A content server is a type of website that is used to provide services for other websites, such as storing pictures for other websites, analyzing viewer information, performing traffic rating and content filtering, etc., which are not typically displayed directly to a user.
After a user inputs a website in a browser of a terminal device and searches, the browser can visit a website corresponding to the website and also visit a plurality of websites (many websites belong to content server type websites) along with the website, the websites are used for providing advertisement content, statistical access information or pictures for the websites directly visited by the user, and the users of the websites visited along with the browser cannot sense the websites.
Because the content server is not a malicious website generally, when the website security detection is performed, if the content server can be filtered out, the efficiency of the website security detection can be improved.
At present, a method for determining whether a website is a content server specifically includes: when a user accesses a certain website through a browser of a terminal device, the browser initiates a request for accessing the website to a gateway device, the gateway device obtains the request, determines a Uniform Resource Locator (URL) of the website according to the request, records the URL, forwards the request to a target server (a server running the website), the target server responds to the request after receiving the request and returns a response message to the gateway device, and the gateway device judges whether the URL is a content server according to data contained in the response message. For example, when the data included in the response message is a session, blank content, or a 1 × 1 picture, the URL is determined to be the content server.
The above method of determining a content server is inefficient.
Disclosure of Invention
The embodiment of the invention provides a method and a device for determining a content server, which are used for improving the efficiency of determining the content server.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in a first aspect, a method for determining a content server is provided, including: the gateway equipment acquires a website access record in a preset time period, wherein the website access record comprises N Uniform Resource Locators (URLs) accessed in the preset time period and access times corresponding to the N URLs, and N is an integer greater than 0; the gateway equipment determines M target URLs in the N URLs according to the website access records, wherein the M target URLs are M URLs with the highest probability of being a content server in the N URLs, and M is an integer which is greater than 0 and less than or equal to N; the gateway equipment accesses a Host corresponding to each URL in M target URLs, and receives M parameters returned by a plurality of target servers running the M hosts of the M target URLs, wherein one parameter comprises a hypertext transfer protocol (HTTP) return value and the number of bytes of return data; the gateway device determines the content servers in the M target URLs according to the M parameters.
In the method provided by the first aspect, the URL with a low probability of being the content server in the URLs of the plurality of websites is excluded by adopting the access records of the websites, so that the number of URLs of which the gateway equipment needs to determine whether to be the content server is greatly reduced, and the efficiency of determining the content server by the gateway equipment is improved. When the website security detection is performed, the probability that the excluded URL is the content server is low, and even if the content server is included, the number of the excluded URLs is small, so that the efficiency of the website security detection cannot be greatly influenced.
With reference to the first aspect, in a first possible implementation manner, the determining, by the gateway device, M target URLs in N URLs according to the website access record includes: the gateway device determines a URL meeting a condition 1 and/or a condition 2 in the N URLs as a target URL, wherein the condition 1 is as follows: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained after sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained after sequencing the Host corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently, and X, Y are all integers greater than 0 and less than 100.
With reference to the first aspect, in a second possible implementation manner, the website access record further includes an identifier of a terminal device accessing N URLs within a preset time period, and the determining, by the gateway device, M target URLs in the N URLs according to the website access record includes: the gateway device determines, as a target URL, a URL that satisfies one or more of conditions 1, 2, and 3 among the N URLs, where condition 1 is: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained after sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained after sequencing the Host corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently; the condition 3 is: the number of the identifications of the terminal devices accessing the URL is greater than or equal to a first preset threshold, and X, Y are integers which are greater than 0 and smaller than 100.
In the two possible implementations, the probability that the URL that satisfies the preset condition (the preset condition is one or more of the condition 1, the condition 2, or the condition 3) is greater than the probability that the URL that does not satisfy the preset condition is the content server.
With reference to the first aspect, the first possible implementation manner, or the second possible implementation manner of the first aspect, in a third possible implementation manner, the determining, by the gateway device, content servers in M target URLs according to M parameters includes: when the HTTP return value in the parameter corresponding to one target URL is not 200, or the HTTP return value in the parameter corresponding to one target URL is 200 and the number of bytes of returned data is less than or equal to a second preset threshold value, the gateway device determines that the target URL is the content server.
In a second aspect, there is provided a gateway device, comprising: the website access record comprises N Uniform Resource Locators (URLs) accessed in a preset time period and the number of access times corresponding to the N URLs, wherein N is an integer greater than 0; a first determining unit, configured to determine M target URLs from the N URLs according to the website access record, where the M target URLs are M URLs with the highest probability of being a content server from the N URLs, and M is an integer greater than 0 and less than or equal to N; the receiving and sending unit is used for accessing the Host corresponding to each URL in the M target URLs and receiving M parameters returned by a plurality of target servers running the M hosts of the M target URLs, wherein one parameter comprises a hypertext transfer protocol (HTTP) return value and the number of bytes of return data; and the second determining unit is used for determining the content servers in the M target URLs according to the M parameters.
Each unit in the gateway device provided in the second aspect is configured to execute the method provided in the first aspect, and therefore beneficial effects of the gateway device may refer to beneficial effects of the method provided in the first aspect, which are not described herein again.
With reference to the second aspect, in a first possible implementation manner, the first determining unit is specifically configured to: determining a URL meeting the condition 1 and/or the condition 2 in the N URLs as a target URL, wherein the condition 1 is as follows: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained after sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained after sequencing the Host corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently, and X, Y are all integers greater than 0 and less than 100.
With reference to the second aspect, in a second possible implementation manner, the website access record further includes identifiers of terminal devices accessing N URLs within a preset time period, and the first determining unit is specifically configured to: determining, as a target URL, a URL that satisfies one or more of conditions 1, 2, and 3 among the N URLs, where condition 1 is: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained after sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained after sequencing the Host corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently; the condition 3 is: the number of the identifications of the terminal devices accessing the URL is greater than or equal to a first preset threshold, and X, Y are integers which are greater than 0 and smaller than 100.
In the two possible implementations, the probability that the URL that satisfies the preset condition (the preset condition is one or more of the condition 1, the condition 2, or the condition 3) is greater than the probability that the URL that does not satisfy the preset condition is the content server.
With reference to the second aspect, the first possible implementation manner or the second possible implementation manner of the second aspect, in a third possible implementation manner, the second determining unit is specifically configured to: and when the HTTP return value in the parameter corresponding to the target URL is not 200, or the HTTP return value in the parameter corresponding to the target URL is 200 and the number of bytes of returned data is less than or equal to a second preset threshold value, determining that the target URL is the content server.
In a third aspect, a gateway device is provided, including: a memory, a processor, and a transceiver, the memory to store code, the processor to perform the following actions in accordance with the code: acquiring a website access record in a preset time period, wherein the website access record comprises N Uniform Resource Locators (URLs) accessed in the preset time period and access times corresponding to the N URLs, and N is an integer greater than 0; determining M target URLs in the N URLs according to the website access records, wherein the M target URLs are M URLs with the highest probability of being a content server in the N URLs, and M is an integer which is greater than 0 and less than or equal to N; the system comprises a transceiver, a server and a server, wherein the transceiver is used for accessing a Host corresponding to each URL in M target URLs and receiving M parameters returned by a plurality of target servers running the M hosts of the M target URLs, and one parameter comprises a hypertext transfer protocol (HTTP) return value and the number of bytes of return data; and the processor is also used for determining the content servers in the M target URLs according to the M parameters.
Each device in the gateway device provided in the third aspect is configured to execute the method provided in the first aspect, and therefore, beneficial effects of the gateway device may refer to beneficial effects of the method provided in the first aspect, which are not described herein again.
With reference to the third aspect, in a first possible implementation manner, the processor is specifically configured to: determining a URL meeting the condition 1 and/or the condition 2 in the N URLs as a target URL, wherein the condition 1 is as follows: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained after sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained after sequencing the Host corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently, and X, Y are all integers greater than 0 and less than 100.
With reference to the third aspect, in a second possible implementation manner, the website access record further includes identifiers of terminal devices accessing N URLs within a preset time period, and the processor is specifically configured to: determining, as a target URL, a URL that satisfies one or more of conditions 1, 2, and 3 among the N URLs, where condition 1 is: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained after sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained after sequencing the Host corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently; the condition 3 is: the number of the identifications of the terminal devices accessing the URL is greater than or equal to a first preset threshold, and X, Y are integers which are greater than 0 and smaller than 100.
In the two possible implementations, the probability that the URL that satisfies the preset condition (the preset condition is one or more of the condition 1, the condition 2, or the condition 3) is greater than the probability that the URL that does not satisfy the preset condition is the content server.
With reference to the third aspect, the first possible implementation manner, or the second possible implementation manner of the third aspect, in a third possible implementation manner, the processor is specifically configured to: and when the HTTP return value in the parameter corresponding to the target URL is not 200, or the HTTP return value in the parameter corresponding to the target URL is 200 and the number of bytes of returned data is less than or equal to a second preset threshold value, determining that the target URL is the content server.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram illustrating a network system according to an embodiment of the present invention;
fig. 2 is a schematic composition diagram of a gateway device according to an embodiment of the present invention;
fig. 3 is a flowchart of a method for determining a content server according to an embodiment of the present invention;
fig. 4 is a flowchart of a method for determining a content server according to another embodiment of the present invention;
fig. 5 is a schematic composition diagram of a gateway device according to an embodiment of the present invention;
fig. 6 is a schematic composition diagram of another gateway device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. "plurality" herein means two or more.
An embodiment of the present invention provides a network system for implementing the method provided in the embodiment of the present invention, as shown in fig. 1, including: the system comprises one or more terminal devices, a gateway device connected with the one or more terminal devices, and one or more target servers connected with the gateway device. The user may access the website through a terminal device, the terminal device may be a computer, a mobile phone, or a tablet computer, the gateway device is disposed at an exit of a terminal device network, and is configured to process (for example, message filtering or message detecting) and/or forward a message communicated between the terminal device and a target server, the gateway device may specifically be a router or a firewall, and one or more websites may be run on one target server. It should be noted that, in order to make the description clearer, in the description of the embodiment of the present invention, the content server refers to a type of website, which is used to provide services for other websites, and the target server refers to a hardware carrier running the website.
The device for executing the method provided by the embodiment of the present invention may be a gateway device, and the hardware architecture composition of the gateway device may refer to fig. 2, including: a network interface, a memory connected to the network interface, and a Central Processing Unit (CPU) connected to the memory.
The network interface can be divided into an input interface and an output interface, wherein the input interface is used for inputting network data to the gateway device, and the output interface is used for outputting the network data from the gateway device;
the CPU is composed of an arithmetic unit and a controller, the arithmetic unit is mainly used for processing network data, and the controller is used for analyzing instructions and sending control signals to all parts of the system orderly and purposefully according to the requirements of the instructions so that the whole system works coordinately and consistently. The memory can store network data and can read the stored network data according to the command. The CPU may be specifically arm (advanced RISC machines), mips (microprocessor with interleaved Pipeline stages), X86 processor, and the like.
An embodiment of the present invention provides a method for determining a content server, as shown in fig. 3, including:
301. the gateway equipment acquires a website access record in a preset time period, wherein the website access record comprises N accessed URLs in the preset time period and access times corresponding to the N accessed URLs, and N is an integer greater than 0.
The length of the preset time period may be set according to an actual application scenario or a requirement, for example, the preset time period may be 5 minutes or 10 minutes, and the length of the preset time period is not specifically limited in the embodiment of the present invention.
Specifically, in the website access record, one URL represents one website, and one URL corresponds to the number of times of access of the URL. The gateway device may record all URLs visited within a preset time period, and then count the number of visits of each URL to obtain a website visit record. For example, the website access record may be as shown in table 1, wherein the number of accesses of URL1 is 3, the number of accesses of URL2 is 9, and the number of accesses of URL3 is 7.
TABLE 1
URL Number of accesses
URL1 3
URL2 9
URL3 7
In order to make the way of acquiring the URL by the gateway device clearer, a brief description is first made of a process of accessing the website by the user. When a user accesses a website through terminal equipment, the terminal equipment sends a request for accessing the website to gateway equipment, the gateway equipment processes the request according to the function provided by the gateway equipment and then sends the request to a target server running the website, the target server returns a response message to the gateway equipment after responding to the request, and the gateway equipment detects data contained in the response message according to the provided function and then returns the data to the terminal equipment.
The gateway device may obtain the URL of the website according to the received request for the terminal device to access the website. Specifically, the request for accessing the website received by the gateway device may be an HTTP message, a header of the request for accessing the website includes a Host field and a Path field, and the URL of the website may be obtained by sequentially connecting contents in the Host field and the Path field. For example, if the content in the Host field is s3.tbcdn.com and the content in the Path field is get/img/3.js, then the URL of the website is s3.tbcdn.com/get/img/3. js. The method provided by the embodiment of the invention can determine the content server based on the actual request for accessing the website, so that the method can adapt to the continuously changed or newly added content server.
302. The gateway device determines M target URLs in the N URLs according to the website access records, wherein the M target URLs are M URLs with the highest probability of being a content server in the N URLs, and M is an integer which is larger than 0 and smaller than or equal to N.
Optionally, the step 302 may include, in a specific implementation: and determining the URL meeting the condition 1 and/or the condition 2 in the N URLs as a target URL.
Optionally, the website access record further includes identifiers of terminal devices accessing N URLs within a preset time period, in this case, the step 302 may include, when implemented specifically: determining, as the target URL, a URL that satisfies one or more of condition 1, condition 2, and condition 3 among the N URLs. The identifier of the terminal device is used to uniquely identify the terminal device, and may specifically be an Internet Protocol (IP) address or a Media Access Control (MAC) address of the terminal device. The gateway device may obtain the identifier of the terminal device accessing the website according to a process in which the gateway device establishes a connection with the terminal device accessing the website. Specifically, in a preset time period, how many different terminal devices access a URL, and how many identifiers of the terminal devices accessing the URL are included in the website access record.
In the above two alternative methods, condition 1 is: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained after sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained after sequencing the Host corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently; the condition 3 is: the number of the identifications of the terminal devices accessing the URL is greater than or equal to a first preset threshold, and X, Y are integers which are greater than 0 and smaller than 100.
The value of X may be determined according to the actual application scenario, for example, when the value of N is large, the value of X may be set to be large, and when the value of N is small, the value of X may be set to be small, for example, X may be 80 or 50. The value of Y is determined in the same way. The first preset threshold may be determined according to an actual application scenario, which is not specifically limited in the embodiment of the present invention, for example, when determining a content server in a website accessed by a terminal device in an enterprise, the first preset threshold may be set to 10% or 20% of the total number of the terminal devices in the enterprise.
Wherein, the URL includes a Host and a Path, for example, when the URL is s3.tbcdn.com/get/img/3.js, the Host of the URL is: com, Path is: get/img/3.js, where the URL corresponds to a primary domain name: com. When a URL only includes a Host, the Host corresponding to the URL is the URL.
Specifically, since the number of access times of the content server is higher than that of an ordinary website (i.e., a non-content server), when the number of access times of the first-level domain name or Host of the URL is larger, the probability that the URL is the content server is larger; since most people visit the Host of the URL of a commonly used website (e.g., hundredth or Taobao), and the commonly used website is not a content server, if the Host of a URL is not visited alone, the probability that the URL is a content server is high; since the content server is a website accompanied by access other than a website directly accessed by the user, and different users may accompany access to the same content server when accessing different websites, the probability that the URL is the content server increases as the number of identifiers of terminal devices accessing the URL increases.
Step 302 is illustrated below by way of a specific example, where condition 1 is: the first-level domain name corresponding to the URL is at the top X% in the first sorted result. If N is 10, 10 URLs are: s3.tbcdn.com/get/img/3.js, da.so.com/q/136614, s3.tbcdn.com, china.baidu.com/query 64, wenwenwenwen.sogou.com/query, mingyi.sogou.com/mingyiquery, pic.tbcdn.com/p ═ 06050, china.baidu.com, s3. tbcdn.com/query 64, wenwen.sogou.com/. query, the first-level domain names of 10 URLs are respectively: com, so.com, tbcdn.com, baidu.com, sogou.com, tbcdn.com, baidu.com, tbcdn.com, sogou.com. The number of times the primary domain name corresponding to 10 URLs was accessed is shown in table 2.
TABLE 2
First level Domain name Number of accesses
tbcdn.com 4
so.com 1
baidu.com 2
sogou.com 3
Then, the first-level domain names corresponding to the N URLs are ordered according to the sequence of the access times from large to small, and the result obtained after the ordering is: and tbcdn.com, sogout.com, baidu.com, so.com, when X is 50, the top 50% of the first-level domains in the ordering result are named tbcdn.com and sogout.com.
In the example described based on table 2, the URL corresponding to the URL where the Host is not separately visited includes: da.so.com/q/136614, wenwenwen.sogou.com/ques, mingyi.sogou.com/mingyiquery, pic.tbcdn.com/p [ (& w ═ 06050, and wen.sogou.com/? And (5) query.
In this example, in a specific implementation, if the URL satisfying the condition 1 and the condition 2 is determined as the target URL in the 10 URLs, the target URL is:
wenwenwen. query, mingyi.sogou.com/mingyiquary, wenwenwenwen.sogou.com/ques and pic.tbcdn.com/p ═ 06050.
303. The gateway device accesses a Host corresponding to each of the M target URLs, and receives M parameters returned by a plurality of target servers running the M hosts of the M target URLs, wherein one parameter comprises a Hyper Text Transfer Protocol (HTTP) return value and the number of bytes of return data.
Specifically, the gateway device may obtain the parameter corresponding to the target URL according to a response message returned by the target server running the target URL. The response message returned by the target server and received by the gateway device may be an HTTP message, the response message includes an HTTP return value and returned data, the response message may include a field indicating the number of bytes of the returned data, and the gateway device may obtain the number of bytes of the returned data according to the field.
Specifically, when the HTTP return value is 200, it indicates that the target server has successfully returned the data requested by the gateway device, and the data included in the response message is the data requested by the gateway device. When the HTTP return value is not 200, it indicates that the target server has not successfully returned the data requested by the gateway device.
304. The gateway device determines the content servers in the M target URLs according to the M parameters.
The step 304 may be implemented as follows: and when the HTTP return value in the parameter corresponding to the target URL is not 200, or the HTTP return value in the parameter corresponding to the target URL is 200 and the number of bytes of returned data is less than or equal to a second preset threshold value, determining that the target URL is the content server.
The second preset threshold may be an empirical value, generally not greater than 100, and may be specifically set to several or several tens according to actual experience.
An embodiment of the present invention further provides a method for determining a content server, which is used to exemplarily describe the method described in fig. 3, in this example, a target URL satisfies condition 1, condition 2, and condition 3, where condition 1 is: the first-level domain name corresponding to the URL is at the top X% in the first sorting result, and the explanation of the content related to the above embodiment in this example can be found in the above, as shown in fig. 4, where the method includes:
401. the gateway equipment acquires the website access record in a preset time period.
The website access record comprises N accessed URLs within a preset time period, access times corresponding to the N URLs and identifications of terminal devices accessing the N URLs within the preset time period.
402. And the gateway equipment sorts the primary domain names corresponding to the N URLs according to the sequence of the access times from large to small to obtain a first sorting result.
403. The gateway device determines whether the primary domain name corresponding to each of the N URLs is at the top X% in the first ranking result.
If yes, go to step 404, otherwise go to step 409.
404. The gateway device determines whether the Host corresponding to the URL has not been accessed separately.
If yes, go to step 405, otherwise go to step 409.
405. The gateway device determines whether the number of different identifiers of the terminal device accessing the URL is greater than or equal to a first preset threshold value.
If yes, go to step 406, otherwise go to step 409.
406. And the gateway equipment accesses the Host corresponding to the URL and receives the parameter returned by the target server operating the Host.
Wherein the parameters comprise an HTTP return value and the number of bytes of return data.
407. The gateway device determines whether the HTTP return value is not 200 or whether the HTTP return value is 200 and the number of bytes of returned data is less than or equal to a second preset threshold.
If yes, go to step 408, otherwise go to step 409.
408. The URL is determined to be a content server.
409. It is determined that the URL is not a content server.
As can be seen from the description based on the embodiment described in fig. 4, whether the URL is a content server can be detected online on the gateway device by deploying a program for implementing the method shown in fig. 4 in the gateway device.
According to the method provided by the embodiment of the invention, the URL with low probability of being the content server in the URLs of a plurality of websites is excluded by adopting the access records of the websites, so that the number of the URLs of which the gateway equipment needs to determine whether to be the content server is greatly reduced, and the efficiency of determining the content server by the gateway equipment is improved. When the website security detection is performed, the probability that the excluded URL is the content server is low, and even if the content server is included, the number of the excluded URLs is small, so that the efficiency of the website security detection cannot be greatly influenced.
An embodiment of the present invention further provides a gateway device 50, as shown in fig. 5, including:
an obtaining unit 501, configured to obtain a website access record in a preset time period, where the website access record includes N URLs and access times corresponding to the N URLs, where N is an integer greater than 0, where the N URLs are accessed in the preset time period;
a first determining unit 502, configured to determine M target URLs from the N URLs according to the website access record, where the M target URLs are M URLs from the N URLs that have a highest probability of being a content server, and M is an integer greater than 0 and less than or equal to N;
a transceiving unit 503, configured to access a Host corresponding to each of the M target URLs, and receive M parameters returned by a plurality of target servers operating the M hosts of the M target URLs, where one parameter includes a HTTP return value and a number of bytes of return data;
a second determining unit 504, configured to determine, according to the M parameters, content servers in the M target URLs.
Optionally, the first determining unit 502 is specifically configured to: determining a URL meeting the condition 1 and/or the condition 2 in the N URLs as a target URL, wherein the condition 1 is as follows: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained by sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained by sequencing the hosts corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently, and X, Y are all integers greater than 0 and less than 100.
Optionally, the website access record further includes identifiers of terminal devices accessing the N URLs within the preset time period, and the first determining unit 502 is specifically configured to: determining, as a target URL, a URL that satisfies one or more of conditions 1, 2, and 3 among the N URLs, where condition 1 is: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained by sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained by sequencing the hosts corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently; the condition 3 is: the number of the identifications of the terminal devices accessing the URL is greater than or equal to a first preset threshold, and X, Y are integers which are greater than 0 and smaller than 100.
Optionally, the second determining unit 504 is specifically configured to: and when the HTTP return value in the parameter corresponding to the target URL is not 200, or the HTTP return value in the parameter corresponding to the target URL is 200 and the number of bytes of returned data is less than or equal to a second preset threshold value, determining that the target URL is the content server.
Each unit in the gateway device 50 provided in the embodiment of the present invention is configured to execute the method, and therefore, beneficial effects of the gateway device 50 may refer to beneficial effects of the method, which are not described herein again.
An embodiment of the present invention further provides a gateway device 60, as shown in fig. 6, including: a memory 601, a processor 602, a transceiver 603 and a bus system 604, wherein the memory 601 is used for storing codes, the processor 602 is used for executing steps 301 and 304 in the method shown in fig. 3 according to the codes, the transceiver 603 is used for executing step 303 in the method shown in fig. 3, the processor 602 is further used for executing steps 401 and 407 and 409 in the method shown in fig. 4, and the transceiver 603 is further used for executing step 406 in the method shown in fig. 4.
The memory 601, the processor 602, and the transceiver 603 are coupled via a bus system 604, wherein the memory 601 may comprise a random access memory, and may further comprise a non-volatile memory, such as at least one disk memory. The bus system 604 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus system 604 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
The transceiver unit 503 in fig. 5 may be the transceiver 603, the remaining units may be the processor 602, and the remaining units may be embedded in a hardware form or a processor independent from the gateway device, or may be stored in a memory of the gateway device in a software form, so that the processor may invoke and execute operations corresponding to the above units, where the processor may be a CPU, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement the embodiments of the present invention.
Each device in the gateway device 60 provided in the embodiment of the present invention is configured to execute the method, and therefore, beneficial effects of the gateway device may refer to beneficial effects of the method, which are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or two or more modules may be integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a form of hardware and a software functional module.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Claims (12)

1. A method for determining a content server, comprising:
the method comprises the steps that a gateway device obtains a website access record in a preset time period, wherein the website access record comprises N Uniform Resource Locators (URLs) accessed in the preset time period and access times corresponding to the N URLs, and N is an integer larger than 0;
the gateway equipment determines M target URLs in the N URLs according to the website access records, wherein the M target URLs are M URLs with the highest probability of being content servers in the N URLs, and M is an integer which is greater than 0 and less than or equal to N;
the gateway equipment accesses a Host corresponding to each URL in the M target URLs, and receives M parameters returned by a plurality of target servers running the M hosts of the M target URLs, wherein one parameter comprises a hypertext transfer protocol (HTTP) return value and the number of bytes of return data;
and the gateway equipment determines the content servers in the M target URLs according to the M parameters.
2. The method of claim 1, wherein the gateway device determines M target URLs among the N URLs according to the website visitation record, including:
the gateway device determines a URL satisfying condition 1 and/or condition 2 among the N URLs as a target URL, where condition 1 is: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained by sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained by sequencing the hosts corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently, and X, Y are all integers greater than 0 and less than 100.
3. The method according to claim 1, wherein the website access record further includes an identifier of a terminal device accessing the N URLs within the preset time period, and the gateway device determines M target URLs among the N URLs according to the website access record, including:
the gateway device determines, as a target URL, a URL that satisfies one or more of conditions 1, 2, and 3 among the N URLs, where condition 1 is: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained by sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained by sequencing the hosts corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently; the condition 3 is: the number of the identifications of the terminal devices accessing the URL is greater than or equal to a first preset threshold, and X, Y are integers which are greater than 0 and smaller than 100.
4. The method according to any one of claims 1-3, wherein the gateway device determines the content servers in the M target URLs according to the M parameters, comprising:
and when the HTTP return value in the parameter corresponding to the target URL is not 200, or the HTTP return value in the parameter corresponding to the target URL is 200 and the number of bytes of returned data is less than or equal to a second preset threshold value, the gateway equipment determines that the target URL is the content server.
5. A gateway device, comprising:
the website access record comprises N Uniform Resource Locators (URLs) accessed in a preset time period and the number of access times corresponding to the N URLs, wherein N is an integer greater than 0;
a first determining unit, configured to determine M target URLs from the N URLs according to the website access record, where the M target URLs are M URLs from the N URLs that have a highest probability of being a content server, and M is an integer greater than 0 and less than or equal to N;
the receiving and sending unit is used for accessing the Host corresponding to each URL in the M target URLs and receiving M parameters returned by a plurality of target servers operating the M hosts of the M target URLs, wherein one parameter comprises a hypertext transfer protocol (HTTP) return value and the number of bytes of return data;
a second determining unit, configured to determine, according to the M parameters, content servers in the M target URLs.
6. The gateway device according to claim 5, wherein the first determining unit is specifically configured to:
determining a URL meeting the condition 1 and/or the condition 2 in the N URLs as a target URL, wherein the condition 1 is as follows: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained by sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained by sequencing the hosts corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently, and X, Y are all integers greater than 0 and less than 100.
7. The gateway device according to claim 5, wherein the website access record further includes an identifier of a terminal device accessing the N URLs within the preset time period, and the first determining unit is specifically configured to:
determining, as a target URL, a URL that satisfies one or more of conditions 1, 2, and 3 among the N URLs, where condition 1 is: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained by sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained by sequencing the hosts corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently; the condition 3 is: the number of the identifications of the terminal devices accessing the URL is greater than or equal to a first preset threshold, and X, Y are integers which are greater than 0 and smaller than 100.
8. The gateway device according to any one of claims 5 to 7, wherein the second determining unit is specifically configured to:
and when the HTTP return value in the parameter corresponding to the target URL is not 200, or the HTTP return value in the parameter corresponding to the target URL is 200 and the number of bytes of returned data is less than or equal to a second preset threshold value, determining that the target URL is the content server.
9. A gateway device, comprising: a memory, a processor, and a transceiver, the memory to store code, the processor to perform the following actions in accordance with the code:
acquiring a website access record in a preset time period, wherein the website access record comprises N Uniform Resource Locators (URLs) accessed in the preset time period and access times corresponding to the N URLs, and N is an integer greater than 0;
determining M target URLs in the N URLs according to the website access record, wherein the M target URLs are M URLs with the highest probability of being a content server in the N URLs, and M is an integer which is greater than 0 and less than or equal to N;
the transceiver is used for accessing a Host corresponding to each URL in the M target URLs and receiving M parameters returned by a plurality of target servers operating the M hosts of the M target URLs, wherein one parameter comprises a hypertext transfer protocol (HTTP) return value and a number of bytes of return data;
the processor is further configured to determine content servers in the M target URLs according to the M parameters.
10. The gateway device of claim 9, wherein the processor is specifically configured to:
determining a URL meeting the condition 1 and/or the condition 2 in the N URLs as a target URL, wherein the condition 1 is as follows: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained by sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained by sequencing the hosts corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently, and X, Y are all integers greater than 0 and less than 100.
11. The gateway device according to claim 9, wherein the website access record further includes an identifier of a terminal device accessing the N URLs within the preset time period, and the processor is specifically configured to:
determining, as a target URL, a URL that satisfies one or more of conditions 1, 2, and 3 among the N URLs, where condition 1 is: the first-level domain name corresponding to the URL is located at the top X% in the first sequencing result, the first sequencing result is obtained by sequencing the first-level domain names corresponding to the N URLs according to the sequence of the access times from large to small, or the Host corresponding to the URL is located at the top Y% in the second sequencing result, and the second sequencing result is obtained by sequencing the hosts corresponding to the N URLs according to the sequence of the access times from large to small; the condition 2 is: the Host corresponding to the URL is not accessed independently; the condition 3 is: the number of the identifications of the terminal devices accessing the URL is greater than or equal to a first preset threshold, and X, Y are integers which are greater than 0 and smaller than 100.
12. The gateway device according to any of claims 9-11, wherein the processor is specifically configured to:
and when the HTTP return value in the parameter corresponding to the target URL is not 200, or the HTTP return value in the parameter corresponding to the target URL is 200 and the number of bytes of returned data is less than or equal to a second preset threshold value, determining that the target URL is the content server.
CN201610767748.3A 2016-08-30 2016-08-30 Method and device for determining content server Active CN107786604B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610767748.3A CN107786604B (en) 2016-08-30 2016-08-30 Method and device for determining content server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610767748.3A CN107786604B (en) 2016-08-30 2016-08-30 Method and device for determining content server

Publications (2)

Publication Number Publication Date
CN107786604A CN107786604A (en) 2018-03-09
CN107786604B true CN107786604B (en) 2020-04-28

Family

ID=61440789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610767748.3A Active CN107786604B (en) 2016-08-30 2016-08-30 Method and device for determining content server

Country Status (1)

Country Link
CN (1) CN107786604B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6185619B1 (en) * 1996-12-09 2001-02-06 Genuity Inc. Method and apparatus for balancing the process load on network servers according to network and serve based policies
JP2003256310A (en) * 2002-03-05 2003-09-12 Nec Corp Server load decentralizing system, server load decentralizing apparatus, content management apparatus and server load decentralizing program
CN105323320B (en) * 2015-11-11 2018-09-25 中国联合网络通信集团有限公司 A kind of method and device of content distribution

Also Published As

Publication number Publication date
CN107786604A (en) 2018-03-09

Similar Documents

Publication Publication Date Title
CN108345642B (en) Method, storage medium and server for crawling website data by proxy IP
CN104391979B (en) Network malice reptile recognition methods and device
US10043199B2 (en) Method, device and system for publishing merchandise information
CN106933871B (en) Short link processing method and device and short link server
US11201911B2 (en) Method, client, server, and system for sharing content
CN108304410B (en) Method and device for detecting abnormal access page and data analysis method
CN110609937A (en) Crawler identification method and device
WO2018001078A1 (en) Url matching method and device, and storage medium
CN110830445B (en) Method and device for identifying abnormal access object
CN104219230B (en) Identify method and the device of malicious websites
CN109359263B (en) User behavior feature extraction method and system
CN109359250A (en) Uniform resource locator processing method, device, server and readable storage medium storing program for executing
CN104618388B (en) Fast registration login method and corresponding resetting server, information server
CN106790593B (en) Page processing method and device
CN110636068A (en) Method and device for identifying unknown CDN node in CC attack protection
CN112131507A (en) Website content processing method, device, server and computer-readable storage medium
CN111756796B (en) Method and device for pushing rights and interests resource information, storage medium and terminal
US20180337930A1 (en) Method and apparatus for providing website authentication data for search engine
CN107483565B (en) Service background identification method, proxy server and computer storage medium
CN108664493B (en) Method and device for counting validity of URL (Uniform resource locator), electronic equipment and storage medium
CN106202297A (en) Identify the method and device of user interest
CN110929129A (en) Information detection method, equipment and machine-readable storage medium
CN106897297B (en) Method and device for determining access path between website columns
CN111767481B (en) Access processing method, device, equipment and storage medium
CN106021582B (en) Method for filtering position information, method and device for extracting effective webpage information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20211222

Address after: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee after: HUAWEI TECHNOLOGIES Co.,Ltd.

Address before: 215123 Building A3, Creative Industry Park, 328 Xinghu Street, Suzhou Industrial Park, Jiangsu Province

Patentee before: Huawei digital technology (Suzhou) Co.,Ltd.

Effective date of registration: 20211222

Address after: 450046 Floor 9, building 1, Zhengshang Boya Plaza, Longzihu wisdom Island, Zhengdong New Area, Zhengzhou City, Henan Province

Patentee after: Super fusion Digital Technology Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.