WO2006027590A1 - Systeme, procede et appareil de surveillance ou de controle d'acces internet - Google Patents

Systeme, procede et appareil de surveillance ou de controle d'acces internet Download PDF

Info

Publication number
WO2006027590A1
WO2006027590A1 PCT/GB2005/003465 GB2005003465W WO2006027590A1 WO 2006027590 A1 WO2006027590 A1 WO 2006027590A1 GB 2005003465 W GB2005003465 W GB 2005003465W WO 2006027590 A1 WO2006027590 A1 WO 2006027590A1
Authority
WO
WIPO (PCT)
Prior art keywords
category
url
cache
specified url
host
Prior art date
Application number
PCT/GB2005/003465
Other languages
English (en)
Inventor
John Sinclair
Ian James Pettener
Alistair Nash
Original Assignee
Surfcontrol Plc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0420024A external-priority patent/GB2418999A/en
Application filed by Surfcontrol Plc filed Critical Surfcontrol Plc
Priority to CA002577259A priority Critical patent/CA2577259A1/fr
Publication of WO2006027590A1 publication Critical patent/WO2006027590A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/2866Architectures; Arrangements
    • H04L67/289Intermediate processing functionally located close to the data consumer application, e.g. in same machine, in same home or in same sub-network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/561Adding application-functional data or data for application control, e.g. adding metadata

Definitions

  • the present invention relates in general to a system, method and apparatus for use in monitoring or controlling
  • the present invention relates to a system, method and apparatus for categorising
  • the Internet is a global interconnection of computers and computer networks.
  • One of the great benefits of the Internet is that many millions of users have access to shared information of the World Wide Web, whereby pages of text and graphic information in HTML or other formats are transmitted by a Hyper Text Transfer Protocol (HTTP) .
  • HTTP Hyper Text Transfer Protocol
  • Each web page has a unique address, known as a Uniform Resource Locator (URL) .
  • URL Uniform Resource Locator
  • RFCs Requests for Comments
  • RFC760 Internet Protocol
  • RFC1738 Uniform Resource Locators
  • a Local Area Network LAN
  • Proxy Server which receives and services URL requests from within the LAN by communicating with the Internet.
  • Some of the client computers in this LAN environment may have relatively limited resources, such as a dumb terminal or diskless workstation.
  • Another example is a Personal Digital Assistant or other handheld computing device.
  • it is desired to provide an apparatus, method and system for monitoring or controlling internet access which is ideally simple, fast and reliable, in this LAN environment.
  • ISP Internet Service Provider
  • the connection is established through dedicated hardware of an Internet gateway appliance such as a modem or a router.
  • an Internet gateway appliance such as a modem or a router.
  • processor requirements, memory requirements, and storage requirements are directly contrary to known approaches for monitoring or controlling Internet access.
  • it is desired to provide an apparatus, method and system for monitoring or controlling internet access which is ideally simple, fast and reliable, when using an Internet gateway appliance.
  • Another emerging need relates to Internet appliances which are created to perform a specific dedicated function whilst also being connected to the Internet.
  • One example is a web TV for displaying audiovisual signals.
  • Such Internet appliances are generally intended for use by consumers who have little or no technical knowledge, by providing a simple and easy to use set of controls as opposed to the fully controllable interface of a regular computer. Again, most Internet appliances are designed to minimise processor, memory and storage requirements.
  • it is desired to provide an apparatus, method and system for monitoring or controlling internet access which is simple, fast and reliable, when using an Internet appliance.
  • An aim of the present invention is to address the disadvantages and problems of the prior art, as discussed above or elsewhere.
  • a method of categorising Uniform Resource Locators (URLs) during Internet access comprising the steps of: receiving a URL request denoting a specified URL; generating a request message to request categorisation of the specified URL; receiving a reply message denoting a category for the specified URL amongst a predetermined set of categories; adding the specified URL and the category to a category cache; and in a second or subsequent instance of a URL request with respect to the specified URL, determining the category of the specified URL from the category cache.
  • URLs Uniform Resource Locators
  • a method for use in controlling or monitoring Internet access at a client device by categorising Uniform Resource Locators comprising the steps of: receiving a specified URL; searching a category cache held at the client device using the specified URL as a search key, and returning a category code associated with the specified URL when a match is found for the specified URL; and generating a request message to request a category code for the specified URL, when a match is not found for the specified URL.
  • URLs Uniform Resource Locators
  • a system for use in controlling or monitoring of Internet access by categorising Uniform Resource Locators comprising: a client device arranged to monitor or control Internet access according to a category code of a specified URL, and including a categorisation module to provide the category code for the specified URL from a category cache stored at the client device or else generate a request message to request categorisation of the specified URL; and a categorisation server coupled to communicate with the client device and arranged to receive the request message and to send a reply message identifying a category code for the specified URL.
  • URLs Uniform Resource Locators
  • a client device comprising: an interface module arranged to present a URL categorisation function, wherein the interface module is arranged to receive a specified URL from a client software and to return a category code; a category cache holding a plurality of stored URLs and associated category codes, such that matching the specified URL against one of the stored URLs provides the category code; and a communication module arranged to send an outgoing request message to a categorisation server when there is no match in the category cache and to receive and buffer incoming data including a corresponding reply message, wherein the request message comprises the specified URL and the reply message comprises the category code.
  • a cache structure comprising: a hash array comprising one or more index elements, each index element comprising a host tree pointer and a hash key derived from a stored URL; and one or more host trees depending from the index elements of the hash array, each host tree comprising one or more tree nodes each holding URL data representing stored URLs and associated category codes; and an age list to list each of the tree nodes by age, wherein the age list comprises, within each tree node, a next pointer (827) and a previous pointer (828) which refer to a next older tree node and a previous newer tree node, respectively.
  • ⁇ module' or ⁇ unit' used herein may include, but are not limited to, a hardware device, such as a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC) , which performs certain tasks.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • elements of the invention may advantageously be configured to reside on an addressable storage medium and be configured to execute on one or more processors.
  • functional elements of the invention may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables .
  • components such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables .
  • the functional elements such as the components, modules and units discussed herein may be combined into fewer elements or further separated into additional elements.
  • Figure 1 is a schematic overview of a system and apparatus as employed in first preferred embodiments of the present invention
  • Figure 2 is a schematic overview of a system and apparatus as employed in second preferred embodiments of the present invention.
  • Figure 3 shows an example of a uniform resource locator (URL) ;
  • Figure 4 shows part of a protocol stack appropriate for communication relating to the Internet;
  • URL uniform resource locator
  • Figure 5 is a schematic view of a preferred method for categorisation of URL requests
  • Figure 6 shows a preferred format of a request message packet
  • Figure 7 shows a preferred format of a reply message packet
  • Figure 8 is a schematic overview of an example client gateway apparatus
  • Figure 9 is a logical representation of a preferred structure of a category cache
  • Figure 10 shows example data held within the category cache of Figure 9;
  • Figure 11 is a schematic overview of a preferred categorisation server apparatus
  • Figure 12 is a schematic overview of a preferred licensing cache structure
  • Figure 13 is a schematic overview of preferred licensing systems.
  • a user machine 10 is connected to the Internet 20 through an Internet gateway appliance or client gateway 12.
  • the preferred embodiments of the present invention are primarily applicable to the World Wide Web, whereby a web page 32 is provided in response to a URL request sent under HTTP.
  • the user machine 10 provides a web browser application which initiates a URL request 11 in order to obtain content, i.e. a web page 32, from a content server or host 30.
  • the web page 32 may take any suitable form, most commonly being text and graphics in HTML format. It will be appreciated however that the present invention is applicable to other forms of content provided over the Internet using URLs, such as file transfers under FTP or connection to a TELNET server.
  • the preferred embodiments of the present invention place each requested URL into one of a predetermined set of categories.
  • Specific downstream actions for controlling or monitoring Internet access such as filtering or logging functions, are not particularly relevant to the present invention and may take any suitable form.
  • the preferred embodiment provides eight core categories such as "adult/sexual explicit”, “criminal skills”, “drugs, alcohol, tobacco”, “violence” or “weapons”, as well as thirty two productivity-related categories such as “advertisements”, “games”, “hobbies and recreation” or “kids sites”.
  • core categories such as "adult/sexual explicit”, “criminal skills”, “drugs, alcohol, tobacco”, “violence” or “weapons”, as well as thirty two productivity-related categories such as “advertisements”, “games”, “hobbies and recreation” or “kids sites”.
  • Providing this predetermined set of categories allows a more sophisticated rules-based filtering or logging function. For example, a rule is used to alert an administrator when a request is made for any of the core categories, or to block selected productivity categories at particular times and allowing access only say at lunchtimes or outside work hours .
  • the preferred categories may also include "don't know” or "not found” options.
  • the user machine 10 provides input and output interface functions appropriate for a human user, suitably including a display screen, speakers, and control keys or
  • the user machine 10 is a computing platform such as a desktop computer, a laptop computer, or a personal digital assistant (PDA) .
  • the user machine is a computing platform such as a desktop computer, a laptop computer, or a personal digital assistant (PDA) .
  • the user machine is a computing platform such as a desktop computer, a laptop computer, or a personal digital assistant (PDA) .
  • the user machine is a computing platform such as a desktop computer, a laptop computer, or a personal digital assistant (PDA) .
  • PDA personal digital assistant
  • the user machine 10 is a function-specific Internet appliance, such as a web-TV.
  • the user machine 10 is a public Internet kiosk, in this case also shown as including a voice telephone.
  • the user machine 10 and the client gateway 12 are formed as physically separate devices and communicate by any appropriate wired or wireless link. In other embodiments the client gateway 12 is integrated within the user machine 10.
  • the client gateway 12 suitably includes a modem, such as an analogue, ISDN or ADSL modem, which connects to an Internet Service Provider (ISP) 21 over the plain old telephone system (POTS) or other wired or optical network to provide a network layer connection to the Internet 20.
  • ISP Internet Service Provider
  • POTS plain old telephone system
  • the client gateway 12 connects to the Internet 20 through a wireless network or cellular mobile network such as GSM or GPRS.
  • the client gateway 12 connects to the Internet 20 through an intermediary such as a LAN or WAN, optionally over a virtual private network (VPN) .
  • VPN virtual private network
  • the client gateway 12 acts as a router and forwards data packets between computers or computer networks.
  • the client gateway 12 directs packets between the user machine 10 and the ISP 21. Routers typically use packet headers and forwarding tables to determine the best path for forwarding each data packet.
  • the client gateway 12 typically has relatively limited computing resources.
  • the client gateway is a router having an Intel IXP422 processor, 64MB RAM and 16MB of Flash memory. There is no hard disk or other large-capacity storage device within the client gateway.
  • the client gateway may also perform other functions, typically acting as a combined modem, router, firewall, local network switch or VPN client, or any combination thereof. Hence, there is strong competition for resources in order to accommodate some or all of these functions within a single low-cost device.
  • the monitoring or controlling function relies, as an initial step, on placing requested URLs into categories.
  • the client gateway 12 typically has only limited available processor, memory and storage resources. Hence, there is a strong need to minimise resources used within the client gateway 12 when providing an Internet access controlling or monitoring function.
  • Figure 2 shows a second example system and apparatus as employed in an alternative embodiment of the present invention.
  • a client computer 12 is part of a Local Area Network (LAN) which also includes a proxy server 14 coupled to the Internet 20.
  • the client computer 12 makes URL requests in order to receive web pages from a content server 30 available over the Internet 20.
  • the URL requests are processed through the proxy server 14. It is desired to monitor or control Internet access at the client computer 12.
  • the present invention is particularly applicable where the client computer 12 has relatively limited processor, memory or storage resources, such as a terminal or a diskless workstation.
  • the client 12 i.e. the client gateway 12 of Figure 1 or the client computer 12 of Figure 2 sends a request message 500 to a server computer 40 hosting a categorisation service 400.
  • the request message 500 identifies a specified URL, such as extracted from a HTTP URL request.
  • This categorisation server 40 identifies one of the predetermined set of categories appropriate to the specified URL, and sends a reply message 600 to the client 12.
  • the reply message 600 identifies the appropriate category, which the client 12 then employs to perform the desired monitoring or controlling function.
  • This arrangement reduces resource requirements at the client 12, and allows the categorisation server 40 to run on a large and powerful computing system with plenty of processing power, memory and storage space.
  • This categorisation service 400 may take any suitable form. For example, upon receiving the URL categorisation request 500, the categorisation service 400 looks up an appropriate category for the specified URL using a category database. Additionally or alternatively, the categorisation service employs a linguistic or other analysis of the specified URLs to determine an appropriate category, with or without human intervention and review.
  • a first aspect of the present invention concerns an improved protocol for communication between first and second computing platforms, in this example between the client 12 and the categorisation server 40, when making requests to place URLs into categories.
  • FIG. 3 shows the standard format of a uniform resource locator (URL), as described in detail in RFC1738.
  • the URL 200 includes a host portion 202 and a page portion
  • the host portion 202 identifies a particular host
  • a root page (i.e. "www.host.com/”) at the host is conveniently shown by giving the host portion 202 as “www.host.com” and the page portion 204 as "/”.
  • FIG. 4 shows part of a standard protocol stack appropriate for communication relating to the Internet, as described in more detail in RFC760 and elsewhere.
  • the Internet Protocol IP
  • IP Internet Protocol
  • the basic function of the Internet Protocol is to move datagrams from a source address to a destination address.
  • HTTP hypertext transfer protocol
  • FIG. 4 shows a Transmission Control Protocol (TCP) as defined for example in RFC761 and a User Datagram Protocol (UDP) as defined for example in RFC768.
  • TCP Transmission Control Protocol
  • UDP User Datagram Protocol
  • TCP is ideal for applications which require reliable delivery of data in a specified order.
  • TCP sets up a connection between hosts, which is maintained open for the duration of a session. Whilst reliable, TCP has a relatively large overhead.
  • UDP is a fast and lightweight protocol, but is relatively unreliable. In particular, delivery and duplication protection are not guaranteed.
  • UDP is connectionless, with no handshaking or acknowledgements between hosts. Hence, neither of these messaging protocols is suited to carrying requests and replies concerning URL categorisation.
  • FIG. 5 is a schematic view of a preferred method for categorisation of URL requests, according to an embodiment of the present invention.
  • a URL request is received at step 401, and a request message 500 is sent at step 402.
  • a reply message 600 is received at step 403, and a URL category is determined at step 404.
  • the request message 500 and the reply message 600 are each sent as the payload of a UDP packet. Surprisingly, it has been found that the unreliable and limited messaging capability of UDP can be employed to advantage in the context of categorisation of URLs. However, in order to use UDP, additional steps are taken by the present invention to adapt the protocol. More detailed explanation of the request message 500 and the reply message 600 now follows.
  • Figure 6 shows a preferred format of the request message packet 500, which includes an Ethernet packet header 501, an IP header 502, a UDP header 503, a UDP payload 504, and an Ethernet trailer 505. These are all formatted according to existing protocols.
  • the UDP payload 504 is divided to form a request message header section 510 and a request message data section 520.
  • the header section 510 comprises a sequence number 511 and a time stamp 512, and suitably a command identity 513, a data size 514, and a licensing field 515.
  • the sequence number 511 allows the request message 500 to be uniquely identified and distinguished from other request messages.
  • the sequence number 511 is generated upon creation of the request message 500 within the client 12, suitably as an incremental value circling between 0 and 65535.
  • each client-side socket exists only for the duration of a request-reply cycle and hence each request is assigned a different port value by the host process within, in this example, the client 12.
  • the sequence number 511 allows a reply to be matched up with an originating request message 500.
  • the time stamp 512 enables calculation of timeouts.
  • the client 12 originating the request message 500 waits a predetermined length of time for a reply message 600, and then re-tries for a predetermined number of times .
  • the timeout is increased after each resend, with an exponential back off (e.g. 2, 4 and then 8 seconds for a maximum retry count of 3) .
  • sequence number 511 and the time stamp 512 together provide excellent reliability, whilst adding only minimal overhead.
  • the command ID field 513 allows the request message to perform different command functions. In most cases, the command ID is set to "1" in order to request categorisation of a URL. Also, the request message uses a command ID of "2" to request that the categorisation server 40 provide a current list of categories, or a command ID of "3" to confirm a current list version and determine whether an update is required. Other commands can be defined as appropriate. Hence, the command ID field 513 brings increased flexibility and allows the system to perform additional functions .
  • the data section 520 contains data representing a specified URL 200.
  • the URL data 520 includes a host portion 202 and, where appropriate, a URL path portion 204.
  • the request data 520 is encrypted, preferably with a secret-key block encryption algorithm such as RC2 which is described in detail at RFC2268. Encryption of the data section 520 improves security and privacy. However, encrypting only the data section 520 minimises both encryption workload and transmission overhead.
  • the size of the encrypted data section 520 is stored as the data size field 514 in the request header 510
  • the licensing field 515 optionally transmits a licence identity relevant to the originator of the request message 500.
  • the licence identity is suitably associated with the client 12 or optionally the user machine 10.
  • FIG. 7 is a schematic representation of a reply message 600 as generated by the categorisation server 40 and sent to the client 12.
  • the reply message 600 includes a UDP payload comprising a response header 610 and a response data section 620.
  • the response header 610 comprises a sequence number 611 and a time stamp 612, preferably with a command ID 613, all copied from a corresponding received categorisation request message 500.
  • a data size 614 gives a size of the following response data section 620.
  • a status code 615 denotes a status. This is usually simply "success", but occasionally relates to one of a predetermined set of error statuses.
  • the response data 620 is formatted according to the relevant command ID 613 and is preferably encrypted, such as with RC2.
  • the response data 620 comprises a category 621, a match length 622, and an exact flag 623.
  • the category 621 identifies one amongst a predetermined set of categories for the URL sent in the request data 520, suitably as a numerical value (e.g. category "27" is say sports related web pages) .
  • the exact flag 623 determines whether the requested URL 520 was matched exactly. If only a partial match was obtained, such as a match with only the host portion 202 or only part of the URL path 204, then a match length is given in the match length field 622.
  • the match length determines a number of characters of the specified URL 520 which were matched with a stored URL at the server 40.
  • the character count is taken along the host portion 202 or the path portion 204, or both. In the preferred embodiment, the count is taken along the path portion 204 only.
  • a match on the root page "/" counts as one character.
  • the response data 620 contains other data such as a category list specifying a predetermined list of categories, or a version identity which identifies a current version of the category list being used by the categorisation server 40.
  • a category list specifying a predetermined list of categories
  • a version identity which identifies a current version of the category list being used by the categorisation server 40.
  • the request message 500 and reply message 600 each use the payload section of a
  • UDP packet which usually has a maximum size of 65Kb as defined by the MTU (Maximum Transmission Unit) of the network.
  • MTU Maximum Transmission Unit
  • Ethernet physical layer packet has a maximum size of just 1500 bytes. Even so, in the present invention almost all of the request and reply messages 500,600 for categorisation of URLs fit within the very limited size constraints of a single Ethernet packet, thus avoiding fragmentation.
  • FIG 8 shows the client 12 in more detail, including an interface module 121, a communication module 122, a protocol module 123 and an encryption module 124.
  • the interface module 121 presents the URL categorisation function to a client application, such as to a web browser or a HTTP function (not shown) .
  • the interface is suitably an API (application programming interface) to the client software.
  • the interface module 121 is passed a URL from the client software, and returns a categorisation code 621, preferably with a match length 622 and an exact flag 623.
  • the communication module 122 sends outgoing data to the categorisation server 40 and receives and buffers incoming data, including making retransmission requests as necessary.
  • the protocol module 123 interprets the incoming and outgoing data according to the protocol discussed above with reference to Figures 5, 6 & 7 and makes encryption/decryption calls to the encryption module 124.
  • the encryption module 124 encrypts and decrypts data.
  • the communication module 122 calculates a retransmission timeout for every sent request. To be effective, it is desired that the timeout interval take account of vastly varying network conditions, and adapt accordingly. This helps to eliminate both unnecessary retransmissions and unrealistically high timeout periods.
  • the number of retries is configurable such as through a user interface.
  • the preferred method for calculating the re ⁇ transmission timeout "rto” includes (a) measuring the round-trip time "mt" for each request, (b) maintaining a estimate of the smoothed round-trip time "srtt”, and (c) maintaining an estimate of the smoothed mean deviation "smd".
  • the estimates are calculated as:
  • timeout value is calculated as :
  • this formula is quickly calculated using fixed-point arithmetic and bit shifts .
  • next timeout is exponentially increased by:
  • the preferred embodiment of the present invention has many advantages, including in particular minimising overhead when requesting categorisation of URL requests and minimising workload at the gateway appliance 12.
  • the preferred embodiment employs UDP for speed and simplicity, whilst adding a sequence number and time stamp to improve reliability.
  • Figure 8 shows that the client 12 preferably comprises a category cache 125.
  • the category cache 125 stores URL categories by storing response data 620 from each categorisation request 500. Since users often navigate to a limited set of favourite web pages time and again, the category cache 125 significantly reduces traffic over the
  • Internet 20 by avoiding duplication of requests for categorisation of the same URL or a child page from the same host or directory.
  • Figure 9 is a logical representation showing a preferred structure of the category cache 125.
  • the cache is structured for both lookups of stored URLs, and also for aging of the cache to ensure that the cache remains within a predetermined maximum memory size. These two functions, namely lookup and aging, are combined so that both share the same nodes in the cache structure, which reduces cache size requirements.
  • the cache 125 is compact and so occupies only a relatively small footprint within the memory of the client 12, whilst still recording valuable data in a manner that is readily searchable and updateable.
  • the method of the present invention preferably includes the step 405 of adding the determined URL category to the category cache 125.
  • the cache structure comprises a hash array 810, and combined host trees and age list 820.
  • the host portion 202 of each URL is hashed to produce an index 811 in the hash array 810.
  • Many hosts may produce the same hash index 811, and each array element is a pointer to a root tree node of a host tree 820.
  • Hosts with the same hash are searched through the host tree 820, which is preferably a balanced red-black tree where each node has a red/black bit to colour the node red or black.
  • Each node 821 comprises a host string 822 holding a host portion 202, and optionally an array of pages 823 for the specified host 822.
  • Left and right pointers 825, 826 are used for searching the tree 820.
  • Each node also includes next and previous pointers 827,828 which refer to a next (older) node and a previous (newer) node, respectively, for aging.
  • each node includes a parent node pointer 824 to allow for fast node deletions.
  • next and previous node pointers 827,828 allow the nodes to be arranged in order by age. New nodes are added to the head of the age list, and old nodes are removed from the tail. When the cache is full and has reached a predetermined maximum size, the oldest node is removed to make room for a new URL to be added in a new host node. Conveniently, the age list is refreshed, in order to keep the most recently accessed nodes at the head of the age list.
  • the memory footprint of the category cache 125 is configured in bytes, in order to determine the maximum size occupied by the hash array 810 and tree list 820.
  • the size may be configured in use through a control panel, or determined automatically according to needs of the client and thereby balance available resources amongst neighbouring functions.
  • the hash array 810 has a predetermined length, which is ideally a prime number for better hash distribution.
  • the hash array length is suitably dynamically configurable, such as by being a variable which is input from a control panel during use . A longer hash array yields faster categorisations, but uses more memory.
  • the hashing algorithm is suitably MD4 or MD5.
  • a URL host portion 202 and a URL path 204 are extracted from a URL request 11 within HTTP or equivalent.
  • the host portion 202 is hashed to determine an index 811 in the hash array 810, and the respective host tree 820 is searched to locate a node 821 matching the host portion
  • the URL path portion 204 is then searched against the page array 823.
  • Figure 10 shows example data held in the host string 822 and the page array 823.
  • the host string 822 includes the host portion 902.
  • a category code 906 and a children flag 908 are provided for the host, or else these can be presented in a root page.
  • the page array includes, for the or each page, a page string 904, a category code 906 for that page or directory, and a children flag 908.
  • the host is "www.host.com” and a searched URL path is ⁇ /directory__l/page__l” .
  • the entry for the page string 904 Vdirectory_l” has a children flag 908 of ⁇ yes" which shows that specific category codes are available for children of this path.
  • the cache shows that ⁇ Vdirectory_l/page_9" has already been cached, but there is currently no entry for the searched page string "/directory_l/page_l” .
  • the cache 125 has failed to provide a category for the requested URL.
  • a request message 500 is generated to determine the code for the specified URL, i.e. for host "www.host.com” and the path "/directory_l/page_l".
  • the children flag 908 for the page "/directory_l” is set to "no", which allows a cache result to be returned with confidence for the searched page based on a partial match. For example, if the children flag for "/directory_l” is set to "no”, then a confident category code is returned for the requested "/directory_l/page_l” based on a partial match with "/directory_l" as a parent of the requested child page.
  • the cache 125 is suitably built by storing data from request messages 500 and reply messages 600.
  • the request message 500 identifies the specified URL with the host portion 202 and the page portion 204 conveniently provided as a delimited character string.
  • the host portion 202 forms the host string 902.
  • the exact flag 623 determines the children flag 908.
  • the match length field 622 determines a truncation point for the specified URL as a number of characters.
  • the category code field 621 provides the category code 906.
  • the gateway appliance 12 preferably further includes a custom cache 126 alongside the category cache 125.
  • the custom cache 126 records a customised list of categorisations.
  • the custom cache 126 is used to override other categorisations, or to add supplementary URLs.
  • the custom cache 126 is structured identical to the category cache 125. Searches are preferably conducted in order through the custom cache 126, then if necessary the category cache 125, and finally if necessary by generating a request message 500 to the categorisation server 40.
  • the custom cache 126 does not perform any URL aging, so that a user has full control over the size and content of the custom cache 126. In this case, the previous and next pointers 827,828 are not required or are left unused.
  • the category cache 125 and/or the custom cache 126 can be cleared completely and then rebuilt with fresh data, such as after a reset operation.
  • each cache 125,126 may also be given a partial clear out, such as deleting all hosts 822 or pages 823 with a specified category code.
  • the cache structure described with reference to Figures 8 and 9 enables convenient cache management, whilst being efficient to operate.
  • Figure 11 is a schematic view of the categorisation server 40 including a main module 410, a communication module 420, a protocol module 430 and an encryption module 440.
  • the main module 410 initialises the categorisation service and creates worker threads.
  • the communication module 420 receives and buffers data and responds to categorisation requests including generation of reply messages 600.
  • the protocol module 430 unmarshals incoming data into a comprehensible command format and marshals outgoing data into a transmittable format, and makes encryption/decryption calls to the encryption unit 440 where required.
  • the encryption unit 440 encrypts and decrypts data, preferably according to the RC2 algorithm.
  • the categorisation service 400 running on the categorisation server 40 performs a licensing process.
  • This licensing process controls access to the categorisation service, such as for security and to enable paid-for subscription based implementations.
  • the licensing process employed in the preferred embodiments of the present invention is highly flexible and is readily integrated with other existing licensing mechanisms .
  • each request message 500 preferably includes a licensing field 515 which carries data such as a licence key.
  • the licensing field 515 is subdivided into a partner ID field 516 and a client ID field 517.
  • the partner ID field 516 allows a plurality of different licensing schemes to exist in parallel, each having different requirements or validation processes.
  • the categorisation service 400 comprises a licensing module 450 associated with the main module 410, which performs validation of the supplied licensing field 515.
  • the licensing module 450 receives the licensing field 515 and returns a "licence valid" or "licence invalid" status which controls whether or not the categorisation server 40 will respond to a categorisation request message 500.
  • the licensing module 450 runs as a dynamically linked library (DLL) .
  • DLL dynamically linked library
  • the categorisation service 400 includes a plurality of licensing DLLs 450, one of which is called to validate the licensing field 515 according to the partner ID field 516. This allows different licensing schemes to be applied for different clients .
  • the partner ID field 516 is 4 bytes long, giving up to 65535 licensing partner identities.
  • the client ID field 517 is suitably up to 60 printable characters long, allowing room for any appropriate secure licensing mechanism.
  • the categorisation server 40 preferably comprises a license cache 455 to store recently encountered license fields 515.
  • the licensing process comprises first checking whether the received licensing field 515 is stored in the licensing cache 455, and then calling the licensing validation DLL 450. Suitably, the result of each licensing call is then added to the licensing cache 455 and is then available for subsequent requests from that client 12. Since clients tend to access the Internet in short burst of activity, it is likely that one categorisation request 500 will be followed by another soon after.
  • the license cache 455 significantly improves response speed for second and subsequent requests.
  • Figure 12 is a schematic overview of the structure of the licensing cache 455. The structure is similar to that of the category cache 125 as discussed above with reference to Figure 9.
  • the licensing cache 455 comprises a hash array 1210 and one or more combined license trees and age list 1220.
  • the hash array 1210 comprises index elements 1211 as a hash of license keys from the licensing field 515, each of which is a pointer to a licence tree list 1220.
  • Each tree node 1221 comprises a license string 1222 holding a license key and a corresponding license result
  • the cache can hold solely valid keys, solely invalid keys, or, as in this example, a mixture of both, according to the circumstances of a particular implementation.
  • each tree node 1221 comprises parent, left and right pointers 1223,1224,1225 defining the tree structure.
  • This example shows a balanced red/black tree using a red/black flag 1228.
  • the license trees 1220 also functions as an age list to list each of the tree nodes 1221 by age.
  • the age list comprises, within each tree node 1221, a next pointer 1226 and a previous pointer 1227 which refer to a next older tree node and a previous newer tree node, respectively.
  • the license cache 455 is actively managed to reside within a predetermined memory size.
  • Older tree nodes 1221 are deleted from a tail of the age list by referring to the next and previous pointers 1226,1227, whilst new nodes are added to the head of the age list.
  • the age list is updated after each access to keep recently accessed nodes at the head of the list.
  • the license cache is preferably flushed, in whole or in part, such as at scheduled regular timed intervals or following triggering events such as a reset.
  • Figure 13 shows example licensing schemes in more detail.
  • the categorisation service 400 makes calls to a license interface DLL 1350, which in turn makes calls one of a plurality of partner licence DLLs 1360.
  • the license interface DLL 1350 optionally includes the license cache 455.
  • the licence interface DLL first consults the licence cache 455 and then, if necessary, request licence validation by one of the partner licence DLLs 1360.
  • DLL 1350 resolves the partner ID field 516 by referring to a partner map database 1352, which links the partner ID
  • the partner licence DLLs 1360 include a no license DLL 1361 which simply indicates that any licence key is valid. This allows the system to run a default "no problem" licence mode prior to implementation of licence schemes which actively validate licence keys.
  • a no database DLL 1362 performs a mathematical, algorithmic or cryptographic validation of the licence key.
  • a hosted licensing DLL 1364 is provided which forwards licensing requests to a remote licensing server 1370 for validation.
  • the licensing requests are sent over a local area network
  • LAN local area network
  • SOAP SOAP-based web service
  • a database licensing DLL 1366 connects directly into an ODBC database 1380 using a stored procedure to validate the licence key.
  • a licence management interface 1382 is provided to manage the content of the licence database 1380.
  • This aspect of the present invention has many advantages, as discussed above. Licensing is very useful in the context of controlling or monitoring Internet access by categorisation of URLs, and opens up many useful commercial and technical implementations of this technology. Further, the use of a licensing cache reduces time and resources for each validation and increases throughput. The cache is structured to be compact and is easily managed. The use of a partner ID field allows great flexibility and convenience to choose between available licensing schemes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

La présente invention concerne un appareil, un procédé et un système permettant de classer en catégories les URL (Uniform Resource Locator) lors de l'accès à l'Internet (20) depuis un client (12). Il y a génération d'un message de demande (500) par lequel on demande le classement en catégories d'une URL spécifiée, un message de réponse (600) énonçant une catégorie. L'URL spécifié et la catégorie sont ajoutés à une antémémoire de catégories (125). Dans une deuxième instance ou une instance suivante d'une demande d'URL désignant l'URL spécifié, la détermination de la catégorie se fait à partir de l'antémémoire des catégories (125). Cela réduit le trafic de communication dans un réseau tel que l'Internet (20).
PCT/GB2005/003465 2004-09-09 2005-09-09 Systeme, procede et appareil de surveillance ou de controle d'acces internet WO2006027590A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA002577259A CA2577259A1 (fr) 2004-09-09 2005-09-09 Systeme, procede et appareil de surveillance ou de controle d'acces internet

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GB0420024A GB2418999A (en) 2004-09-09 2004-09-09 Categorizing uniform resource locators
GB0420024.2 2004-09-09
US10/952,626 2004-09-28
US10/952,626 US7590716B2 (en) 2004-09-09 2004-09-28 System, method and apparatus for use in monitoring or controlling internet access

Publications (1)

Publication Number Publication Date
WO2006027590A1 true WO2006027590A1 (fr) 2006-03-16

Family

ID=35395784

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2005/003465 WO2006027590A1 (fr) 2004-09-09 2005-09-09 Systeme, procede et appareil de surveillance ou de controle d'acces internet

Country Status (2)

Country Link
CA (1) CA2577259A1 (fr)
WO (1) WO2006027590A1 (fr)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007136665A2 (fr) 2006-05-19 2007-11-29 Cisco Ironport Systems Llc Procédé et appareil destinés à contrôler l'accès à des ressources réseau en fonction d'une réputation
WO2008069945A2 (fr) * 2006-12-01 2008-06-12 Websense, Inc. Système et procédé d'analyse d'adresses web
US7924425B2 (en) 2005-06-27 2011-04-12 The United States Of America As Represented By The Department Of Health And Human Services Spatially selective fixed-optics multicolor fluorescence detection system for a multichannel microfluidic device, and method for detection
US8881277B2 (en) 2007-01-09 2014-11-04 Websense Hosted R&D Limited Method and systems for collecting addresses for remotely accessible information sources
US8938773B2 (en) 2007-02-02 2015-01-20 Websense, Inc. System and method for adding context to prevent data leakage over a computer network
US8959634B2 (en) 2008-03-19 2015-02-17 Websense, Inc. Method and system for protection against information stealing software
US8978140B2 (en) 2006-07-10 2015-03-10 Websense, Inc. System and method of analyzing web content
US9003524B2 (en) 2006-07-10 2015-04-07 Websense, Inc. System and method for analyzing web content
US9015842B2 (en) 2008-03-19 2015-04-21 Websense, Inc. Method and system for protection against information stealing software
US9117054B2 (en) 2012-12-21 2015-08-25 Websense, Inc. Method and aparatus for presence based resource management
US9130972B2 (en) 2009-05-26 2015-09-08 Websense, Inc. Systems and methods for efficient detection of fingerprinted data and information
US9130986B2 (en) 2008-03-19 2015-09-08 Websense, Inc. Method and system for protection against information stealing software
US9378282B2 (en) 2008-06-30 2016-06-28 Raytheon Company System and method for dynamic and real-time categorization of webpages
US9473439B2 (en) 2007-05-18 2016-10-18 Forcepoint Uk Limited Method and apparatus for electronic mail filtering
US9503423B2 (en) 2001-12-07 2016-11-22 Websense, Llc System and method for adapting an internet filter
US9565235B2 (en) 2000-01-28 2017-02-07 Websense, Llc System and method for controlling access to internet sites
CN112738148A (zh) * 2019-10-28 2021-04-30 中兴通讯股份有限公司 缓存内容的批量删除方法、装置、设备和可读存储介质
CN112804373A (zh) * 2020-12-30 2021-05-14 微医云(杭州)控股有限公司 接口域名确定方法、装置、电子设备及存储介质
CN113810471A (zh) * 2021-08-18 2021-12-17 深圳市元征科技股份有限公司 一种数据传输方法、发送设备及接收设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020120754A1 (en) * 2001-02-28 2002-08-29 Anderson Todd J. Category name service
US20030105863A1 (en) * 2001-12-05 2003-06-05 Hegli Ronald Bjorn Filtering techniques for managing access to internet sites or other software applications
US20040006621A1 (en) * 2002-06-27 2004-01-08 Bellinson Craig Adam Content filtering for web browsing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020120754A1 (en) * 2001-02-28 2002-08-29 Anderson Todd J. Category name service
US20030105863A1 (en) * 2001-12-05 2003-06-05 Hegli Ronald Bjorn Filtering techniques for managing access to internet sites or other software applications
US20040006621A1 (en) * 2002-06-27 2004-01-08 Bellinson Craig Adam Content filtering for web browsing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GREENFIELD P ET AL: "Access Prevention techniques for Internet Content Filtering", CSIRO, December 1999 (1999-12-01), XP002265027 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9565235B2 (en) 2000-01-28 2017-02-07 Websense, Llc System and method for controlling access to internet sites
US9503423B2 (en) 2001-12-07 2016-11-22 Websense, Llc System and method for adapting an internet filter
US7924425B2 (en) 2005-06-27 2011-04-12 The United States Of America As Represented By The Department Of Health And Human Services Spatially selective fixed-optics multicolor fluorescence detection system for a multichannel microfluidic device, and method for detection
WO2007136665A2 (fr) 2006-05-19 2007-11-29 Cisco Ironport Systems Llc Procédé et appareil destinés à contrôler l'accès à des ressources réseau en fonction d'une réputation
EP2033108A2 (fr) * 2006-05-19 2009-03-11 Cisco Ironport Systems LLC Procédé et appareil destinés à contrôler l'accès à des ressources réseau en fonction d'une réputation
EP2033108A4 (fr) * 2006-05-19 2014-07-23 Cisco Ironport Systems Llc Procédé et appareil destinés à contrôler l'accès à des ressources réseau en fonction d'une réputation
US8978140B2 (en) 2006-07-10 2015-03-10 Websense, Inc. System and method of analyzing web content
US9723018B2 (en) 2006-07-10 2017-08-01 Websense, Llc System and method of analyzing web content
US9680866B2 (en) 2006-07-10 2017-06-13 Websense, Llc System and method for analyzing web content
US9003524B2 (en) 2006-07-10 2015-04-07 Websense, Inc. System and method for analyzing web content
US9654495B2 (en) 2006-12-01 2017-05-16 Websense, Llc System and method of analyzing web addresses
WO2008069945A3 (fr) * 2006-12-01 2008-09-04 Websense Inc Système et procédé d'analyse d'adresses web
WO2008069945A2 (fr) * 2006-12-01 2008-06-12 Websense, Inc. Système et procédé d'analyse d'adresses web
US8881277B2 (en) 2007-01-09 2014-11-04 Websense Hosted R&D Limited Method and systems for collecting addresses for remotely accessible information sources
US9609001B2 (en) 2007-02-02 2017-03-28 Websense, Llc System and method for adding context to prevent data leakage over a computer network
US8938773B2 (en) 2007-02-02 2015-01-20 Websense, Inc. System and method for adding context to prevent data leakage over a computer network
US9473439B2 (en) 2007-05-18 2016-10-18 Forcepoint Uk Limited Method and apparatus for electronic mail filtering
US8959634B2 (en) 2008-03-19 2015-02-17 Websense, Inc. Method and system for protection against information stealing software
US9455981B2 (en) 2008-03-19 2016-09-27 Forcepoint, LLC Method and system for protection against information stealing software
US9130986B2 (en) 2008-03-19 2015-09-08 Websense, Inc. Method and system for protection against information stealing software
US9495539B2 (en) 2008-03-19 2016-11-15 Websense, Llc Method and system for protection against information stealing software
US9015842B2 (en) 2008-03-19 2015-04-21 Websense, Inc. Method and system for protection against information stealing software
US9378282B2 (en) 2008-06-30 2016-06-28 Raytheon Company System and method for dynamic and real-time categorization of webpages
US9130972B2 (en) 2009-05-26 2015-09-08 Websense, Inc. Systems and methods for efficient detection of fingerprinted data and information
US9117054B2 (en) 2012-12-21 2015-08-25 Websense, Inc. Method and aparatus for presence based resource management
US10044715B2 (en) 2012-12-21 2018-08-07 Forcepoint Llc Method and apparatus for presence based resource management
CN112738148A (zh) * 2019-10-28 2021-04-30 中兴通讯股份有限公司 缓存内容的批量删除方法、装置、设备和可读存储介质
CN112738148B (zh) * 2019-10-28 2024-05-14 中兴通讯股份有限公司 缓存内容的批量删除方法、装置、设备和可读存储介质
CN112804373A (zh) * 2020-12-30 2021-05-14 微医云(杭州)控股有限公司 接口域名确定方法、装置、电子设备及存储介质
CN113810471A (zh) * 2021-08-18 2021-12-17 深圳市元征科技股份有限公司 一种数据传输方法、发送设备及接收设备
CN113810471B (zh) * 2021-08-18 2024-05-14 深圳市元征科技股份有限公司 一种数据传输方法、发送设备及接收设备

Also Published As

Publication number Publication date
CA2577259A1 (fr) 2006-03-16

Similar Documents

Publication Publication Date Title
US7590716B2 (en) System, method and apparatus for use in monitoring or controlling internet access
US8024471B2 (en) System, method and apparatus for use in monitoring or controlling internet access
US8141147B2 (en) System, method and apparatus for use in monitoring or controlling internet access
WO2006027590A1 (fr) Systeme, procede et appareil de surveillance ou de controle d'acces internet
WO2018107784A1 (fr) Procédé et dispositif de détection de canevas web
US7506055B2 (en) System and method for filtering of web-based content stored on a proxy cache server
US9692725B2 (en) Systems and methods for using an HTTP-aware client agent
EP1405224B1 (fr) Système et procédé de chargement de données d'une source d'information dans un dispositif de communication mobile avec transcodage des données
US6912591B2 (en) System and method for patch enabled data transmissions
US9514243B2 (en) Intelligent caching for requests with query strings
US8856279B2 (en) Method and system for object prediction
KR100293373B1 (ko) 인터네트워크를위한공통캐시의생성및이용방법과그시스템
US20040098493A1 (en) Web page access
WO2006027589A1 (fr) Systeme, procede et dispositif pour surveiller ou commander l'acces a internet
JP4988307B2 (ja) コンテキスト・ベースのナビゲーション
KR20190053170A (ko) Dns 요청을 억제하기 위한 시스템 및 방법
WO2006027600A1 (fr) Systeme, procede et dispositif de surveillance ou de controle de l'acces a internet
Chandranmenon et al. Reducing web latency using reference point caching
WO2003083612A2 (fr) Systeme et procede d'optimisation d'applications internet
EP2141891A2 (fr) Solution de serveur à point d'entrée unique pour des services web d'annotation à latence réduite
Berners-Lee What W3 needs from WAIS and x. 500
Bergner IMPROVING NETWORK

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2577259

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase