GB2528030A

GB2528030A - Internet Domain categorization

Info

Publication number: GB2528030A
Application number: GB1408662.3A
Authority: GB
Inventors: Austin Elias Leirvik
Original assignee: AFFECTV Ltd
Current assignee: AFFECTV Ltd
Priority date: 2014-05-15
Filing date: 2014-05-15
Publication date: 2016-01-13
Also published as: GB201408662D0

Abstract

Categorizing a plurality of domain identifiers of network domains using a neural network having a hidden layer connected to an output layer. A plurality of sequences of training data is accessed, each representing a respective user's browsing history and constituting a sequence of domain identifiers of network domains accessed by that user having positions in that sequence which convey the order in which those domains were accessed. The neural network is trained to model relationships between different domain identifiers based on their positions in the sequences of training data relative to one another, the step of training comprising modifying parameters of the hidden layer. Semantic vectors are assigned to the plurality of domain identifiers based on the modified parameters, and categories are assigned to the plurality of domain identifiers by matching their assigned semantic vectors. The neural network may be trained to model the association based only on the conveyed order of the domain identifiers rather than on any content obtained from the identified domains. Website categorization can therefore be achieved without looking at the websites themselves, but rather looking at patterns of website visits.

Description

INTERNET DOMAIN CATEGORIZATION

Technical Fiehi The present invenUon is in the field of machine-learning. flund

s Classification in the present context refers to the task of as&gning a set of objects (e.g. sets of data) to one or more categories, whereby different objects that exhibft a degree of simflarity are assigned to the same category or categories. Each object may be represented as a set of n numerical features in the form of a feature vector defining a point in an n-dimensional feature space (object point), and categotles assigned by matching feature vectors of different objects such that objects having geometricafly similar feature vectors are assigned to the same category or categories.

One example of an algorithm that may be used to this end is a K-means dustering algorithm, whereby K different categories are represented by K points in the feature is space (category points) and each object is assigned to the category having a category point closest to that objecis object point in the feature space. The category points are inftiaUy disposed at random locations in the feature space, and initial assignments made accordingly such Lhat a cluster of objects is assigned to each category point. The category points are then moved to the geometric center of their respective dusters, and the assignments of objects updated accordingly based on the new category point locations. The algorithm repeats until substantiafly no further reassignments are necessary.

For example, each object may represent webpage content (e.g. textual content) of a particular respective website. Web page content of different websites may be hosted at different network domains of a network identified by associated domain names (an example of a domain identifier). The content may be "scraped" from each domain by crawling the web using known web crawling/scraping techniques.

Each of the n components of the feature vector representing a particular website may represent a particuftr keyword occurring in the scraped content of that website and have a numerical vakie which conveys an observed probability of that keyword occurring in that content. The n keywords may for instance be the most frequently occurring words across ail the considered documents. Websites that use similar words will thus have similar feature vectors (nearby object points) and are therefore likely to be assigned to the same category or categories.

Summary

The inventors have recognized a number of drawbacks of existing internet domain categorization techniques that categorize websites based on their content. Firstly, such techniques are typicaHy slow, computationaUy expensive and limited by feature extraction algorithms i.e. algorithms that transform raw HTML code returned by the scraper to a set of features that can be represented in a vector.

Browsing logs may contain hundreds of miUions of URLs, so it is expensive, both in terms of computational time and resources, to maintain the servers and software necessary to crawl all those pages. The content extracted from every single page must then go through feature extraction and classification, which adds more time and complexity: e.g. typical steps include extracting human-readable text from the HTML code, removing words with low semantic value (e.g. The", "an", "of), reducing each word to a representative token (e.g. ran, runs, running -run), and applying an algorithm ike a TE-IDF ("term frequency-inverse document frequency") algorithm to determine the most representative keywords for each page particularly for a large number of domains, this incurs significant computational cost.

Secondly, there is no guarantee that websites which relates to the same or similar topics wiil in fact have much, if any, textual content in common. For instance, two websites which relates to the same topic but which are written in different languages may share substantially none of the same words, thus wifl have dissimilar feature vectors and are unlikely to be placed in the same category that is, the known categonzation algorithms are unlikely to recoanize the inherent, conceptual similarity between these Skis as that simflarity is not captured in the keywor&based feature vectors.

The inventors have also reaUzed that userievel browser logs provide an additional source of information about similarities between content hosted in different network domains (e,g, website content) for the foVowing reasons. In practice, uses are likely to focus their browsing actMty at times on specific topics, and are thus likely at some point to access a number of domakis hosting conceptuafly simUar content (e.g. relating to the same topic) substantiay oneafter4he-other (that is, to access at most a smafl number of websites relating to different topics in between). For example, a web user having an interest in cycling is likely at same point to visiting a number of different websites relating in some respect to cycling substantiafly on&after4he-other.

According to a first aspect, the present invention is directed to a computer implemented method of categorizing a plurality of domain identifiers of network domains using a neural network having a hidden layer connected to an output layer, the method comprising: accessing a plurality of sequences of training data, each representing a respective users historical browsing history and constituting a sequence of domain identifiers of network domains accessed by that user having positions in that sequence which convey the order in which those domains were accessed; training the neural network to model relationships between different domain identifiers based on their positions in the sequences of training data relative to one another, the step of training comprising modifying parameters of the hidden layer; assigning semantic vectors to the plurality of domain identifiers based on the modified parameters of the hidden layer of the trained neural network; and assigning categories to the plurality of domain identifiers by matching their assigned semantic vectors, It has been observed that semantic vectors assigned in this way accurately capture conceptual similarities between different domain identifiers that reflected in the historical bmwsing histories. More specifically, it has been observed that, when conceptual similarities between different domain names are captured in the relative ordering of those domains in the useFlevel browsing history training sequences, those conceptual similarities are realized as geomebic similarities between their semantic feature vectors in their semantic feature space, or at least In a restricted subspace thereof. As will be appreciated, these semantic vectors are thus highly suitable for use in categorization.

s The Inventors have recognized that the relative ordering of accessed domains in users browser logs is, in itself, enough Information to make meaningful and accurate categorizations of those domains. No consideration need be given to any content hosted at those domains, and moreover the relative otdeilng alone captures sufficient information for the neural network to be able to associate conceptually similar domain Identifiers -for instance, no consideration need be considered to the identity of the users (so anonymlzed browser-logs can be used), the actual times or dates at which those domains were visited etc. Further, the present invention advantageously does not require the use of crawling or scraping algorithms, which are typically slow and require both computation resources and network bandwidth to implement.

Overall, the present invention results In significant savings in terms of computation complexity and processing time as compared with known website categorization techniques. Because URLs can be reduced to domains and because the present invention doesn't have the overhead of web scraping, feature extraction, or training and deploying a standard text classifier, as an example, it is possible to process an entire month's worth of browse logs from users at the same time on a single machine. This is opposed to several hours of processing time every day on a network of machines required using the traditional method.

Moreover, a further advantage is that, In contrast to key-word based categorization techniques, the present Invention is capable of recognizing similarities between Internet domains even when they host content that does not share textual similarities e.g. because It is written in different languages or for some other reason, or even when they do not host textual content at all (and only host e.g. audio, video and/or Image content). 4:

This is because the present invention does not require consideration of the content itself -rather, network domains are categorized by looking at the users who access that content.

In embodiments, the step of training may comprise opfimizing a modelled kehood s of the sequences of training data, the step of optimising comprising modifying the parameters of the hidden!ayer. The likeUhood that is optimized may be a rnodeUed log-likeUhood of the training data.

The neural network may be a skipgram neural network, the output layer being configured to compute modefled skipgram probabilities for different skipgram pairs of domain identifiers, and the likelihood that is optimized is a modeNed ikeflhood of a set of observed skip-grams observed in the training sequences.

The output layer may be operable to determine substantiaUy softmax probabHities is based the hidden layer parameters. The output layer may he configured to determine hierarchical softmax probabilities.

The hidden layer may comprise between 50 and 500 nodes the semantic vectors thereby having substantially that number of dimensbns, The step of categorizing may comprise performing a clustering algorithm on the semantic vectors.

Each of the domain identifiers may be a domain name.

The method may further comprise, for at least one category, processing domain identifiers assigned to that category to detect a shared attribute exhibited by at least some of those domain identifiers, and automatically assigning a category label to that category based on the shared attribute. The shared attribute may be a sequence of characters that is present in each of the at least some domains.

S

The users' historical browsing histories may be anonymized historical browsing histories.

The neural network may be trained to model said association based only on the s domain identifiers and their conveyed order and not based on any content obtained from the identified domains.

The hidden layer may be a projection layer.

The hidden layer may be implemented as a lookup table configured to map each of the plurality of domains to its respective semantic vector representation.

In embodiments, there is provided a method of delivering targeted content to a current user of a network comprising a plurality of network domains Identified by a is plurality of domain identifiers, the method comprising: in a training phase, performing any of the categorization methods disclosed herein to assign categories to the plurality of domain identifiers; in a live phase: detecting current browsing activity In the network by the current user, the current browsing activity at a user device associated with the current user and comprising the user accessing at least one of the identified network domains; identifying at least one category that has been assigned to the domain identifier of the accessed network domain in the categorization phase; selecting content for delivery to the current user based on the identified category; and transmitting the selected content to the user device for outputting to the current user.

in embodiments, there is provided a method of delivering targeted content to a user of a network comprising a plurality of network domains Identified by a plurality of domain identifiers, the method comprising: in a training phase, performing any of the categorization methods disclosed herein to assign categories to the plurality of domain identifiers; In a live phase: accessing a browsing history of the user, the browsing history identifying at least one network domain that has been accessed by that user; identifying at least one category that has been assigned to the domain identifier of the accessed network domain in the categorization phase; assigning the user to at least one interest group based on the identified category; selecting content for delivery to the user based on the assigned interest group; and transmitting the selected content to a user device associated with the user for outputting to the user.

According to a second aspect the present invention is directed to a computer readable medium storing code configured, when executed, to implement any or the methods disclosed herein.

According to a third aspect, the present invention is directed to computer system comprising: computer storage holding a plurality of sequences of training data, each representing a respective user's historical browsing history and constituting a sequence of domain identifiers of network domains accessed by that user having positions in that sequence which convey the order in which those domains were is accessed; one or more processors configured in a categorization phase to train a neural network, having a hidden layer connected to an output layer, to model relationships between different domain identifiers based on their positions in the sequences of training data relative to one another, the step of training comprising modifying parameters of the hidden layer; to assign semantic vectors to the plurality of domain identifiers based on the modified parameters of the hidden layer of the trained neural network; and to assign categories to the plurality of domain identifiers by matching their assigned semantic vectors.

in embodiments, the computer system may further comprise a network interface for connecting to a network comprising the plurality of network domains: wherein the processors are configured in a live phase to detect via the network interface a current browsing activity in the network by a current user of the network, the current browsing activity at a user device associated with the current user and comprising the user accessing at least one of the identified network domains; to identify a category that has been assigned to the domain identifier of the accessed network domain in the categorization phase; and to select content For delivery to the current 7:: user based on the identified category; and wherein the network interface is configured to transmit the selected content to the user device of the current user, In embodiments, the computer system may further comprise a network interface for s connecting to a network comprising the plurality of network domains; wherein the processors are configured in a flve phase to access a browsing history of a user, the browsing history identifying at east one network domain that has been accessed by that user; to identify at least one category that has been assigned to the domain identifier of the accessed network domain in the categorization phase; to assign the user to at least one interest group based on the identified category; and to s&ect content for dellvery to the user based on the assigned interest group; and wherein the computer system comprises a network interface configured to transmit the selected content to a user device associated with the user for outputting to the user.

Brief Description of Fkiures

For a better understanding of the invention and to show how the same may be carried into effect, reference will now be made to the foflowing drawings in which: Figure 1 is a schematic iHustration of a computer system; Figure 2 is a schematic block diagram of a server; Figure 3A is functional block diagram of a neuron for use in a neural network; Figure 3 is a functional block diagram of a neural network; Figure 4 is a schematic illustration of a vocabulary of internet domains; Figure 5 is a schematic illustration of user evei browsing histories for different web users; Figure 6A is a flow chart for a method of categorizing Internet domains using a neural network;

S

Figure 6B is a schematic Illustration of some of the steps of the method of figure 6A; Figure 7 is a flow chart for a clustering algorithm; Figure 8 is a schematic illustration of an exemplary application of the clustering algorithm of figure 7; Figure 9is a flow chart for a method of delivering targeted content to a user of a network.

Detailed Descriotion Embodiments will now be described by way of example only.

Figure 1 is a schematic illustration of an exemplary computer system comprising a control seiver 20, and first and second web servers 22a, 22b. The servers 20, 22a, 22b form part of a packet-based computer network 14 such as the Internet. The system also comprises plurality of user devices 18a,18b connected to the network 16. The network 14 implements a naming system, such as the hierarchical Domain Name System (DNS), whereby different parts of the network (referred to herein as is "network domains" or simply "domains") are identified by corresponding domain names e.g. in the form "www.example.com". A domain name is one example of a domain identifier. Within the network 14, each domain name is mapped to the corresponding network domain at one or more domain name servers (not shown).

Specifically each domain name is mapped to one or more network addresses, such as Internet Protocol (IP) addresses, that provide access to the corresponding network domain by way of requests directed to fts network address(es).

In the example of figure 1, withIn the network 14, a first domain name of a first network domain 24a, comprising the first web server 22a. is mapped to a first network address of the first web server 22a; a second domain name of a second network domain 24b, comprising the second web server 22b, Is mapped to a second network address of the second web server 22b. In response to a request message comprising the first (resp. second) domain name, the network returns the first (rasp.

second) network address which can be used by the requestor to access first (reap.

second) content (e.g. website content) of the first (resp. second) domain from the first (resp. second) web server. The first (resp. second) domain is said to host that first (resp. second) content, and the first (rasp. second) domain name can be said to be mapped to and/or associated with that first (resp. second content).

s Each user device 16a,lBb is associated with and accessible to a respective user 16a, I 6b ("user A", "user B") thereby allowing the users I 6a, I 6b to access the network 14. In particular, each user device ISa, 18b executes a respective browser application 1 9a, I 9b for accessing internet domains. Each browser application has a user Interface into which a user can input a domain name to be supplied to the network 14 in a request message, responsive to which the network 14 returns a network address of the identified domain which the browser then uses to access that domain e.g.to download content.

The control server 20 collects historical browsing history data for the plurality of users 18a, 18b. The browsing history data identifies which domains (e.g. 22a, 22b) have been visited by which users (e.g. IGa, 16b) at various past points In time. For each set of identified domains visited by a particular user, the browsing history data conveys the relative order in which those domains were visited by that user. For instance, the browsing history data may identify times at which each of those domains was visited, from which It is possible to determine the order In which they were visited. The control server 201s configured to process the collected browsing history to compile, for each user u (e.g. a, b) under consideration, a sequence of domains names having positions in that sequence which repfect the order in which the identified domains were accessed.

The control server also provides website content (e.g. advertising content or "ads") embedded In web pages e.g. hosted by the web servers 221, 22b etc. Browsing data can be collected using browser cookies. A cookie is a small piece of data sent from a server and stored in the user's browser while the user Is browsing a website. The cookie is first placed on a user's browser when s/he first visits a particular website, and remains stored on the browser until the user actively clears his/her browser cookies. When the user returns to the site that placed the cookie, the website server wiU read its cookie from the user's browser; that is, in effect the server is fetches the cookie from the browser. In this case. when the control server provides embedded website content, an associated cookie is sent from (or sent back to) the control server 2C). Thus whenever an e.g. ad is served on a web&te, the S associated cookie that exists on the user's browser lets the control server 20 keep track of which websites seen this user has accessed before. Third parties (e.g. operators of the domains 24a, 24b etc.) may also provide anonymized browsing data from their own cookies.

Figure 1 shows only a &ngle control server, two users at two user devices and two web servers in two network domains for the sake of cladty. However, it wiU be appredated that the system 1 wiU typicay comprise many more users and user devices, and many more web servers in many more network domains. Further, whst the control server and web servers are depicted as single entities, it wifl be appredated their respective functionality may be distributed across multiple servers.

is Moreover, one web server may serve (form part of) more than one network domain i.e. hold content of different domains and be accessible using different domain names.

Different network domains may host different content (e.g. website content) that has conceptual simHarities, e.g. multiple domains may host respective content that relates to one or more common topics (e.g. news, sport, food, social media etc.).

Thus, domain names can be meaningfufly categorized to reflect those conceptual similarities, e.g. by assigning domain names mapped to respective content relating to a common topic to a category representing that topic. As indicated, previous solutions have focused on categorization methods that require consideration of that content itseff, such as keywordhased categorization methods.

In contrast, the present disdosure recognizes that such meaningful conceptual similarities between domain names can be identified without having to consider the content itself at SI. This is because these conceptual similarities can be inferred from the behaviour of users (e.g. I 6a, I 6b) who access that content, That is, users who visit' (access) the corresponding domains as part of their browsing activity. In particular, the disclosure recognizes that on average users will, at times, visit groups of related websites related to simflar topics, (or exhihifing some other conceptual simUarities) oneafter4he-other, or at east wfth substantially few unrelated websites being visited in between. For instance, two different domains may host two different websftes which are both popular knithngr&ated websftes, if user4evel Hstorical S browsing history data is coflected for a large number of users, that browsing history data can be expected to show a statisficafly significant number of those users as having visited those two websites in quick succession. That is, if each of the users browsing histories is represented as a sequence (ordered list) of domain names, ordered in the order in which the corresponding domains were accessed, it can be expected that at various places in those sequences (enough to be statisticafly significant). the domain names of the two popular knitting websites wifi appear at adjacent or nearly-adjacent positions in those sequences.

In other words, the disclosure recognizes that meaningful (e.g.) website categorization can be achieved without look at the websites themselves, but rather as by looking at the people looking at those websites (i.e. by looking at historical user-level browsing histories). Thus, user-level browse logs are used to infer web&te similarities from patterns of website visits.

Figure 5 shows examples of historical user-level browsing histories for users A and B 16a, lab represented as sequences = ,(a) a) ,, , w[ and w(b) = w{,,,. , wf7, ,.. of domain names visited by user A and use B respectively, ordered to convey an observed order in which the identified domains were visited by (as observed from the coflected historical browsing history data), In other words, a relative time at which each domain was visited by that user is conveyed by its position in that sequence and that position can be considered to represent a relalive time at which that domain was visited.

For example, consider the browsing history of user A. A second domain name w appearing at position 1+1" in the sequence W(a) (e.g. "wool.fr') represents a second domain that was accessed (at relative time "i+l ") immediately following a first domain identified by a first domain name w[ (e.g. "knittingcom") appearing at position 1" in that sequence MN) (which was accessed at relative time "i).

In the example of figure 5, it can be seen that user B also visited "wooLfr" and knitting.com at relaUve times 1" and T+2 respectively (La that they visited wooLfr't, then another web&te, then "knthingcom"), Thus, even by considering only the orders in which users A and B visited various S domains, it can be inferred that "knfttingcorn" and "wootf( might be conceptually simllar on the basis that they appear near to each other in more than one sequence.

When extended to large numbers of users, observations of particular domain names frequently occurring at substanflaHy contiguous positions in a large number of browsing history sequences can be used to infer conceptual simflarities between those domain names with a high level of statistical certainty.

Two domain names = (k) yyF = (k') appearing at positions l" and "i+j in a browsing history sequence are said to have a relative separation for reasons discussed above, where that relative separation is small, this is indicates a possible conceptual similarity (association) between v(C) and p(ë); where those domain names (<) and v(k') frequently appear in browsing sequences having small relative separations, this indicates a conceptual similarity (association) between those domain names with a high level of statistical certainty.

Figure 2 is a schematic block diagram of the control server 20 which constitutes a computer system. As shown in figure 2, the control server 20 comprises a processor 22, e.g. in the form of one or more central pmcessing units (CPUs), a memory 24 in the form of a nontransitory computer readable storage medium such as a magnetic or electronic memory, and a network interface 28. The memory 22 and the network interface 28 are connected to the processor 22. Other control servers may comprise multiple processors and/or the functionallty of the control server may be distributed across muttiple servers that constitute the computer system in this case.

The network interface is for establishing a network connection by which the server can connect to the network 20. In connecting to the network 14, the server 20 becomes part of the network 14.

The memory 24 holds the coflected historic& browsing history data mentioned above (e.g. represented as the sequences w(", w(bJ of figure 5) and also holds code 26 for execution on the processor 22. Among other things, the code 26 is operable when executed (in a first, categorization phase) to process the coected browsing history data. In parficular, the code 26 is configured when executed to process the user browsing history data in order to assign categories to different network domains (e.g. 24a, 24b) identified as having been visited by users (e.g. ISa, 1Gb) in the browsing history data. This is described in detafl bSow.

The memory 22 also holds content, such as audio, video, image an&or text content, that can be transmitted to the user devices I 8a, I Sb for delivery to the users 1 Ga, 1Gb. The code 26 is also operable when executed (in a second, live phase) to detect current browsing activity by a current user (e.g. 16a, 16b or some other user) occurring at a user device at a current point in time (e.g. I Ba, 1 Sb or some other user device) connected to the network 14, and to select content for transmission to that user based on both the detected current browsing activity in dependence on the historical browsing history data coHected for multiple users, In particular, the code 26 can detect when the current user accesses a network domain (or domains) that has (have) afteady been categorized using the historical browsing history data, and selects content which is appropriate to the category (or categories) assigned to that domain (or those domains) for transmission to the current user. For example, where the user accesses a domain that has already been assigned a category identifies are relating to knitting based on the historical browsing history data, knittingrelated content (e.g. advertising content) may be selected for transmission to the user on that basis.

In practice, an ad server may only connect to a socaUed "demandside platform, which is a realtime exchange platform for online advertising impressions. Every time any user visits a webpage which is signed up to the exchange, an auction is held in real4ime among advertisers who bid on this particular impression opportunity. If the advertiser wins the auction, the user will be shown their ad and that advertiser can log the site the user is currently seen to be on.

The mechanisms by which the categorization phase and live phase operation are achieved will now be described in detail.

In the categorization phase, network domains are categorized In part using an artificial neural network (equivalently referred to herein as a "neural network" or s "neural net"). As is known in the art, neural networks are computation models capable of machine learning.

Neural networks are typically represented as a system of interconnected neurons, each neuron having one or more inputs and an output and representing a function that is performed on those Input(s) to generate that output Input neurons typically receive an input and supply an output to other neurons In the network. Output neurons have inputs coupled to outputs of other neurons in the network and supply their generated outputs as the output of the neural network itselt Hidden neurons have both Inputs and outputs which are coupled to other neurons In the neural network.

is Neural networks may be arranged in layers, each layer typically comprising multiple neurons. The input neurons constitute a input layer and the output neurons constitute an output layer. In general, the input layer is passive In that it does not modify its inputs and simply supplies them to another layer in the neural network as they are received (i.e. each Input neuron can be considered as implementing the identity function). In contrast, the output layer does typically perform non-identity (and possibly nonlinear) functions In Its Inputs.

Any hidden neurons having inputs coupled to the outputs of the input layer constitute a first hidden layer, any hidden neurons having Inputs coupled to the outputs of the first hidden layer constitute a second hidden layer etc. A "hidden layer is any layer between the input and output layers. Hidden layers may perform nonlinear functions on their input, but not necessarily.

An example of an individual neuron n Is shown in figure 3A. As shown, the neuron n has K inputs and one output Each of the kiputs receives respective a component X1,. .., Xgp. . . Xg of a K-dimensional vector z that is, the neuron n receives K different values x19... ,XK as Inputs. The output is a single value z. Conventionally, the output of the neuron n is equal to a function f (linear or non-linear) of a weighted sum of its inputs f(w,. a) where a,2 = (%. ..., a,) is a K-dimensional weight vector associated with that neuron and a x is the sum of the components of x wighted by a,1 i.e. Ekxkaflk.

For output and hidden neurons in a neural network, K Is typically greater than I for at least some of neurons. For an input neuron. K is typically I and f is the identity as discussed. In this case, It will appreciated that whilst mathematically the output of an input neuron can be defined in terms of a weighted sum of a one dimensional input to with unitary weighting factor, in practice It is not necessary to actually compute a summation. In some situations, this also holds for non-input neurons in a neural network including those described below.

A neural network "teams" by way of a training procedure based on training data. A neural network can typically receive one of a number of possible Inputs at the input layer from which it generates a respective output at the output layer by performing operations on that input. Each input may represent some data element that occurs in the training data (e.g. a keyword in a vocabulary) the operations performed by the neural network are modified during the training until the outputs matches statistical observations of some other part(s) of the training data (e.g. probabIlity distributions relating to the Input keyword). This is typically achieved by estimating a cost function defined over the neural network outputs which encodes observed properties of the training data.

For instance, where the Input encodes a word "W" that might appear in a sequence of words used to train the neural network (training sequence), the neural network may be trained until the output matches a probability distrIbution hr the word immediately following "W" e.g. an extremely simplified example would be If the word "W" appeared 30 tImes In total at various positions in the training sequences, and is Immediately followed by the word "X" 20 of those times and by the word "V" the remaining 10 of those times; in that case, the neural network may be trained until the output matches a probability distribution whereby the probability of the word immediately preceding "W" being "X" is (at least approximately) twice as high as the probability of the word immediately preceding "W" being "V", and the probability of the word immediately preceding "W" being anything but "X" or "C' is substantially lower than either of those probabilities (e.g. substantially zero).

s The operations performed by the neural network are defined by the collection of weight vectors (weight matrices) of the output neurons as well as those of any hidden neurons.

The disclosure recognizes that, in order to realize the above-mentioned categorization of domain names based on user-level browsing history in practice in an efficient and accurate manner, a technique which, to date, has only every been used in the context of Natural Language Processing (NLP) can be adapted to this end, specifically, a "neural net language model" technique. A neural network language model Is a language model based on neural networks. A language model is a function that captures characteristic of sequences of words in a natural is language.

To date, language models have been exploited e.g. in the context of speech recognition and automated translation; the Inventors have recognized that such language modelling techniques can also be used in a different context, namely domain name categorization, wherein these techniques are applied to user-level browsing histories (as opposed to e.g. works of literature as they are in other contexts).

Specifically, the disclosure draws upon a neural network having a "continuous skip-gram architecture" (see below) described in Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean; "Efficient Estimation of Word Representations in Vector is Space";ln Proceedings of Workshop at ICLR, 2013 (Mlkolov, hereafter). The skip-gram neural operates to assign so-called "continuous vectors" to each object in a vocabulary. These are assigned by training the neural network to model associations between those objects based on various data sequences in which those objects occur. Specifically, associations between objects are modelled based on the positions in the sequences at which those objects occur relative to one another. As described in more detail below, once the neural network has been trained, it can be used to assign continuous vectors to each object in the vocabulary. A "word2vec" software tool Is currently available (https:llcode.google.com/p/word2vec/) that Implements this paper In C code.

s Notably, these continuous vectors capture semantic relationships between different domains when those relationships are reflected in the browsing history training sequences, e.g. with objects that frequently occur in the sequences at adjacent or almost-adjacent positions being assigned similar vectors. it has been observer that, once the skip-gram neural network has been trained and continuous vectors assigned the domain identifiers in the vocabulary accordingly, conceptually similar domain identifiers have geometrically similar continuous vectors (that is, vectors which are to some nearby in the vector space of semantic vectors). Continuous vectors are equivalently referred to herein as "semantic vectors" in a semantic feature space, which is the vector space of the semantic vectors.

is Increasing the dimenslonality of the semantic feature space allows multiple degrees of similarity to be considered -for example, first and second websites relating respectively to French food and Indian food are conceptually similar in that they both relate to food; however, the latter is also conceptually similar to a third website relating to travel in India (as they both relate to India), whereas the former lack this partIcular conceptual similarity with the third website. in this case, it has been observed, where such similarities are captured in the user-level browsing history, the first and second websites can be expected to have semantic vectors that are nearby in some dimensions of the feature space but not in others, and the second and third websites that are also nearby but in different dimensions of the feature space (but not In others).

Thus, "matching of semantic vectors as used herein does not necessarily mean a match across all dimensions of the vector space which they Inhabit (the semantic feature space) -two semantic vectors can be said to match if their components In a restricted subspace of the semantic feature space are sufficiently nearby, thereby Indicating one degree of conceptual similarity.

In other words, whenever conceptual simarities between different domain names are captured in the relative ordering of those domains in the user-level browsing history training sequences, those conceptual simUarities are reafized as geometric similarities between their semantic feature vectors in their semantic feature space, or at least in a restricted subspace thereof The disclosure exploits this by using the semantic vectors to categorize the domain identifiers e.g. by running a clustering algorithm on the semantic vectors to assign domain identifiers with similar semantic vectors to one or more common categories.

Figure 3 is a functional block diagram representing a neural network 1 used in various embodiments of the present invention.

The neural network 1 has a so-called "continuous skip-gram" architecture as defined in the above-mentioned Mikolov paper and described in detail below, The continuous skip-gram neural network contains three layers: -An input layer 2 of the neural network I is a 1-of-V vector representation of a Is particular network domain 0c) in a vocabulary V of size Vl.

-A projectbn layer 4 is a lower-dimensional (lower than the I -of V) representation of this domain, having dimension DI which is less than IVI.

FomiaUy. the lower-dimensional representation for a domain (k) corresponds to the weights between input neuron k in the V-dimensional input vector and the projection layer. The neurons in the input layer do not apply any other function to the inputs they receive; they merely pass them through as output.

For this reason, the projection layer itself can be thought of as the domain's projected representation (i.e. a projected representation of each domain in the vocabulary V). Initially this projection layer representation is initialised randomly for each domain, and is updated iteratively as the neural network 1 is trained (see below).

-An output layer 6 contains one set of IV neurons for every position in a context window" (the size of the context window is decided beforehand). The output layer comprises a probability distribution over domains for each position in a window of size 2C around domain w (e.g. respective probability distributions over w2, w1, w÷1. w(2). The correct' probabUity distributions (obtained by adjusting parameters of the neurS network) are defined by a training set (generated from the browser logs). Each set of VI neurons appUes a softmax function to its inputs in order to model a probabiUty distribution over the VI possible outputs (predicted probability thstributions). In this way, the output layer functions as a collection of log-linear classifiers for predicting the domain at each position in a context window, given a iargeV domain at the input layer.

The "projection laye( is so caUed because it receives input with only one active neuron (i.e. the 1ofV vector) and returns an output with multiple active neurons the single active input is thus projected to multiple outputs.

The neural network I depicted in figure 3 represents functionality implemented by the code 26 when executed on the processor 22 of the control server 20. The neural network I comprises input neurons 3, projection neurons 5, and output neurons 7.

is Each of the neurons 3, 5 and 7 can be viewed as a functional block representing a particular function that can be implemented by the code 26.

The neural network 1 comprises input neurons 3(1),...3(k),...3aV) (where V1 is an integer number discussed below) each having a respective input and a respective output. The input neurons 3 constitute the input layer 2 of the neural network 1.

The neural network I also comprises IDI projection neurons 5(1),... 5(m),..., 5001) (where D1 is another integer number also discussed below. Each of the projection neurons 5 has a respective output and V respective inputs, each of those V inputs of that hidden neuron coupled to the output of a different input neuron 3,The projection neurons 5 constitute the projection layer 4 of the neural network I (one example of an hidden layer).

The neural network I also comprises IVI*c output neurons 7(1,j),...7(k'j) ..7(VjJ),...7(I j),.. 7(k'j'),..7(lV1,j') (where c is yet another integer number discussed below). Each of the output neurons has a respective output and 101 respective inputs, each of those IDI inputs of that output neuron coupled to the output of a different hidden neuron 5. The output neurons 7 constitute the output layer 6 of the neural network 1. The output layer is divided into c subsets 6(j), 6(j),...

of the output layer, each subset comprisftig output neurons 7(1,j) to 7(jV,J), 7(1,j') to 7(,VRf) etc..

s The input layer 2 is for receiving inputs to the neural network 1. In this case, an input to the neural network is an individual domain name encoded as a discrete vector".

The set of all possible domains that can be to the input layer 2 make up the vocabulary V of the neural network 1.

Parts of an exemplary vocabulary V are shown in figure 4. The vocabulary V has a to size equal to the number of possible domains that can be input to the neural network 1, and comprises v'< E V individual domains where k = V. Each domain name in the vocabulary V is encoded for inputting to the input layer 2 of the neural network I using VotV encoding, wherein each domain v(C) E V is encoded as a respective discrete, Vkdimensional vector Yk where each component Yk is where S is the Kronecker delta function; that is, as a discrete vector YR = (0, ... ,i, ... ,0) wherein each but the kth component of that vector is 0 and the kth component is 1.

A domain v(k) is input to the input layer 2 of the neural network 1 encoded as an input vector Yk by supplying each component of the input vector Y& to a different one of the input neurons 3 (e.g. the first component Yk1 = 0 to a first input neuron 3(1), the kth componentykk = 1 to a kth input neuron 3(k), the Vth component Ykw = 0 to a Vth input neuron3(VD). Each of the input neurons 3 is passive in that it supplies, as its output, its input in the form in which that input is received (that is, unmodified) to each of the P projection layer inputs of the P projection neurons to which its output is coupled.

Due to the nature of the 1-cf-V encoding, it will be appreciated that, for any given input YR to the input layer 2 (that is, for all k=1,.., VD, only the kth input neuron is active that is, only the kth neuron outputs a I whereas every other input neuron outputs a 0. Thus, for any given input Yk each of the]D projection neurons *1 5(1) .,5(D) has only one active input (La. equal to 1), that being the kth input of that projection neuron (La. the input coupled to the output of the kth neuron); the remaining V-1 inputs are inactive (equal to 0).

Each of the projection neurons 5 in the projecfion layer 4 outputs a value which is S dependent on which its inputs is active. That is, activating the kth input of each projection neuron 5(m) causes that projection neuron 5(m) to outputs a value krn (equivalently denoted km herein) to the output layer 6. The set of all possible outputs of each of the ID projection neurons constftutes a Vx!D projection matrix P associated with the projection layer 4.

Parts of an exemplary projection mahix P are illustrated in figure 4. Each row of the projection matrix P is associated a respective domain names in that, when that domain name is input as an input vector Yk. the output of the projection layer 4 (formed of the individual outputs of the projection neurons 5)is the kth row k,e) of the projection matrix P, which can be considered a D-dimensional vector dk = ir () p P 1 k,m'' "1 k4DE. -The component of the projection matrix P (equivalent to the set of DHdimensional vectors [dJk = 1..., fl]) are parameters of the projection layer 4.

The values of projedllon matrix P may be stored in memory 24 in a manner that associates each row of the projection matrix P with the corresponding domain name e.g. as a lookup table.

As will be appreciated, each column OJ,rn of the ptjection matrix P constitutes a weight vector of a respective projection neuron 5(m). The output of that projection neuron 5(m) is equal to the weighted sub of its inputs, weighted by However, as will also be appreciated, because only one of these inputs is ever active at a given time (due to the I of-V encoding), there is never a need to actually compute this summation and e.g. a computationally cheap lookup$ype operation that maps the active input to the corresponding output will suffice in practice. 22.

An input vector yk supplied to the input layer causes each of the output neurons 7 to receive as their respective inputs the components of the IDEdimensional vector 4 = -&ipi) generated at the projection layer.

Returning to figure 3, as mentioned, there are c subsets 6(J), 6(j') of the output layer 6, each comprising respective output neurons. in this embodiment, there are *C subsets of the output layer, pertaining to relative separations j = -C, ...,-1,1, ...C, where C is a positive-valued integer C Is said to define a "context window" for reasons that will become apparent; each j = -C, ... -1,1, ... , C defines a position In that context window and oath subset 6(j), 6Q') pertains to a different position J, J' in the context window.

Each output neuron 7(k'j) In each subset of the output layer 6(J) computes a so-called "softmax" probability as a function of its inputs (that are supplied from the projection layer 4). The softmax function as applied to a given Input Pk by each output neuron 7(k'J) Is defined as: exp( 4 Cj? ) softmaxwj(dft) = Zk1,...,lv exp(d P(i/f' );JvVO; o) (1) where Cksj is an output weight vector of dimension P1 of that output neuron 7(k'j).

That Is, each output neuron 7(k'j) In the output layer 6 is associated with a respective weight vector tk'j of dimension P1 which constitute parameters of the output layer 6.

The term B denotes a set of parameters of the neural network I represented as a vector of parameters; in this case, the vector 8 of the neural network parameters comprises the parameters of the output layer and the parameters of the projection layer 4 i.e. B = (9i...sOqs.-.) = (4,ckrj) = (d11, ...,dlyl101,c(z...c)1, As is known in the art, the softmax function of equatIon (1) satisfies certain conditions, such as Eks-1Ivl softmaxksj(4) = Ek'=llvl p(vC"0;JvC'0) = 1 for any given y.a, that makes is suitable for modelling conditional probabilities.

Here, the output layer so'ftmax functions of equation (a) are used to model probability s distributions of observations determined from a coflectlon of user level browsing histories [wCt) Iu E (JJ for each user u In a set of considered users U, the browsing history wfor that user it E U (e.g. I 6a, I 6b in fIgure 1) constitutIng a sequence of training data (training sequence) wt = ... SWt of domain names Pit of network domains visited by that user having position in the sequence w0) that convey the order in which those domains were accessed by that user Specifically, the softmax function softmaxksj(dk) is used to model a conditional probability p(v(k');jlv() of the domain at a position J away from some arbitrary position I In a training sequence being v" V given that the domain at that arbitrary position tin that training sequence Is itself Ck) V. That is, the probability is of that a domain in a given sequence of domaIns, having relative separation "j' from another domain In that given sequence, Is v(C') given that said other domain is (k), This is the reason for adopting the notation p(v('C');jjvQO; o) = softmaxksJ(dk). as p(vC1");JvQO; o) represents a model of the true' conditional probability distribution p(z#');fjv(1t)). It is assumed that the conditional probability being modelled does not depend on I but only on a relative separation J of neIghbouring domains In the training sequence (i.e. that It only depends on where Clt') and Ck) appear relative to one another.

The output of each of the output layer subsets 6(j), generated from a particular input domain (k), Is thus a modelled conditional probability distribution which can be expressed as an output probabIlIty vector OI; e) = (p(v(1);jJvC10; a), ,p(v( W);j(v(k) ; a), pvV1;jjvCk; a)), each component defined by the softmax function of equation (1). Thus, the output of the output layer 6 Is a collection of a conditional probability distribution [POIv; o)J jI = 1... c] over domains in V1 one distribution for each position fin the context window 2C.

Because 8 is a vector of adjustable parameters, the probability distributions POIv°'; o) can be considered as defining a family of probability distributions, each probability distribution In the family being parameterised by a different vector value of 8.

As will be described in detail below, in order to train the neural network 1, the vector components of 8 are adjusted until the outputs of the output layer 4 (I.e. the modelled conditional probability distributions for each position In the context window 2C) match the training sequences [w('OIu e u]. More specifically, a set of "observations" on the training sequences [w001u ci] is defined, and 0 is adapted until the conditional probability distributions of the output later match those observations. In practice, this is achieved by optimizing a suitable cost function (loss fnction), defined below.

Each domain in a sequence appears in the context of neighbouring domains in that sequence. For instance, for a training sequence of domain names = the fact that the domain name at position "r away from position "I' is a domain name w[3 e V represents one definable context in which the domain at position "I" (i.e. w[) appears. The term "context window" defined by 2C mentioned above is adopted because C limits the extent to which a domain (e.g. w[t3) in a sequence is considered to define a context in which a neighbouring domain (e.g. appears i.e. only domains which have a relative separatIon of no more than C and no less than -c are considered to define contexts for one another.

It should be noted that, in practice, the size IVI of the domain name vocabulary V may be very large (e.g. vp may be in the order of a million). In terms of computational efficiency, although the input layer is of size lvi, this is not an Issue because, as discussed, only one input neuron is ever active at a given time which effecbv&y reduces the dimensionaHty of the input layer in practice. However, this is not true of the output layer, which may have massive dimensionality. To reduce this dimenVonality, in practice, the softmax computation of the output layer may be s simpUfied by representing the output as a Huffmann binary tree! and calculating approximate Thierarchicar softrnax prohahUiUes instead. This technique is known in the art and is mentioned, for instance, in the MikokW' reference mentioned above.

The idea is to represent the set of possible domains V as a binary tree (e.g. a network of connected nodes in which each node has two children nodes), such that more common domains exist closer to the root node. Huffman binary trees are described e.g. in Huffman, David A. "A method for the construction of minimum redundancy codes.' Proceedings of the IRE 40.9 (1952): 1 09&1 101. Then, rather than encoding the probability of a particular domain, each output neuron encodes the probability of traveUing left or right at a given level in order to reach the correct is domain from the root node, Since it takes log2aV) levels to create a binary tree with items, the number of output units per position is reduced from Vi,to -Iog2V).

csrization Phase A method of categorizing domain names (c) in a vocabulary V of domain names will now be described with reference to figures 6A and $B, The former is a flow chart for the method, and the after a graphical illustration of some of the method steps.

As mentioned, initiauy the projection layer representation of the vocabulary V (i.e. the projection matrix P) is inifialised randomly for each domain, and is updated iteratively as the neural network I is trained (see below).

There are D weights to be updated at each training iteration between the input and projection layers (not V*tD weights because the 1 ofV input layer will only ever have one neuron activated at any one time). These weights correspond to the randomly initialized values of the projection matrix P. There are DN2C*V) weights to he updated between the projection and output layers.

As indicated, the neur& network I is trained using sequences of trairüng data (representing user-level browser logs), each sequence representing a respective user's browsing history and constituting a sequence of domain identifiers of internet domains accessed by that user having positions in that sequence which convey the s order in which those domains were accessed.

At step S2, anonyniized browser logs for different uses 16a, 16 etc. are accessed in memory 24 for use in training the neural network 1 As discussed, each browser log for a respective user 16a (resp. 1Gb) is in the form of a sequence w -- (a) (Il) (al (h) (h) v.'1 w1÷ (reap. w -= wI,...,w, , ..) etc. of domam names identifying network domains accessed by that user, and constitutes a respective training sequence of training data, That is, the training data (trainingData) used to train the neural network 1 comprises, for each user u in a set of users U for whom browsing history data is available (for each it E U), a sequence of training data (training sequence) w(Th) representing that user's (it) browsing history La 1.5 trainingData = [w@Ou E UI, where = w1, ...JwL() isa sequence (ordered list) of domain names having length t(u).

At step S4, the domain name vocabulary V. made up of domain names that appear in the various training sequences. is generated based on the training sequences.

For instance, the vocabulary may comprise all unique domain names that appear in the training sequences w(h) etc. Alternatively, the vocabulary may not comprise all domains that appear in the training sequences e.g. vocabulary V may have a predetermined size and the vocabulary may be generated to comprise only the most frequently occurring domain names. Alternatively, the vocabulary V may have a variable size Vi,, but be generated to include only domain names that appear more than a threshold number of times in the sequences.

As discussed, the vocabulary V is encoded using 1of-V encoding, whereby each domain name (k) e V is encoded as a respective discrete input vector Yk (see above) which form a set of discrete input vectors Y. At step S6, the projection matrix P of the neural network projection layer is randomly initialized, as are the weight vectors Ck:j of the output layer. That is, the set of parameters 8 Is Initialized with random values.

At step S8, a set of observations to be used as a training set (trainingSet) in training the neural network I is generated from the training data. The set of observations comprises all observed "skip-gram" pairs within the context window in the training sequences. As Is known in the art, a k-skip-n-gram is a length-n subsequence of a sequence having components that occur at distance at most k from seth other In that sequence. The set of observations (trainingset) is defined as a set of observed 2-skIp-C-grams in the training sequences [w@O] for each considered relative separation j: tra!ningset = [tralningsetj Il, = 1... C] and trainingset1 = ft w[, wj) I = 1... L(u), u ti] where each skip-gram pair (wf'°, wt3) is an observation on the training data, and represents a length-2 subsequence of the sequence w.

is Specifically, each observation ( v (it), vOc')) in tralnlngSetj represents the fact that the domain v(k) has been observed at a relative separation 7 from the domain v0") at some location in a training data sequence. The set of observations (trainlngSet) Is thus a set of all domain pairings observed in the training data (IrainingData) for each of the possible relative separations j=-C,. ..,-1,I,...,C. Where a particular pair (, kt')) occurs N times in the in trainlngsetj, this represents the fact that that the domain (k) has been observed at a relative separation 7 from the domain (k') at N occurances In the various training sequences.

Steps SIO-S12 constitute Iterative training steps. As explained in more detail below, during training (SI 0-512), a cost function in the form of the negative log-likelihood of the training data, nam&y the user-level browse logs, is minimized (more specifically.

the negative og4kehhood of the set of observations (trainingset) generated from the training daLa (trainingData) is minimized). The following model parameters are updated during training: s Weights between the output layer and projection layer (parameters of the output layer).

-Weights between the projection layer and the input layer (parameters of the projection layer).

In this manner, the neural network I is trained to model relationships between different domain identifiers based on their positions in the sequences wOO of training data relative to one another, so that the probability distribution outputs at the output layer substantially match the sequences of training data, That is, the neural network is trained to model associations between different domain identifiers based on their relative separations in the training sequences. These r&ative separations are captured in the set of observations (trainingSet) as each observation is a skip-gram domain pair that results because that pair of domains appears somewhere in a training sequence at sufficiently proximate locations (as defined by the window size C); optimizing the negative og-llkelihood of the set of observations has the effect of matching the probabflity distributions at the output layer to the set of observations itself i.e. so that the modelled (i.e. predicted) probability of observations that occur more times in the set of observations is higher than the modelled (i.e. predicted) probability of observations that occur fewer times in the set of observations.

Typically, the context window size C may be around 5-10. Limiting the context window size C effectively places a Umit (ag. of between about 5 and 10) on the relative separation that can be used to infer an association between different domains i.e. when two domains appear in a training sequence but a positions more than apart, this limit prevents an association from being inferred between those domains on that basis.

Advantageously, the inventors have recognized that the actual identity of users is unimportant in this context it is sufficient for the neural network I to he simply told' that identified network domains have been visited in particular orders. As indicated, this enables the use of anonymized browser logs which is advantageous from a privacy point of view, Moreover, the order hi which domains were accessed encodes sufficient informahon s for the neural network to recognize conceptuaUy simHar domains. That is, the neural network does not have to be told anything other than some information about the relative ordering of domain names in the sequences to be able to model the conceptual similarities. Thus, even if a training sequence is comped to have an order based on identified times at which domains were accessed, this information is not supplied to the neured network nor does it need it be retained once that sequence have been compiled.

Thus, the training sequences can be stored in a compact and efficient manner, eq.

as a simple text document consisting of a list of domain identffiers (e.g. domain names, or e.g. numerical identifiers uniquely identifying domain names) and substantiay no other information, The neural network is trained using a maximum likelihood technique. Recafl that the softmax function of equation (1) defines a family of probability distributions (JI'°; e) parameterized by 8, which are the outputs of the output layer 6.

The intention of the training is to find a probability distribution in the family, parameterized by some particular set of values of 9, that maximizes the probabiUty of the set of observations (trainirigset). That is, the intention of the training is to maximize a likelihood L(O,trainingSet) of the set of observations (trainingSct) with respect to 0, where L(8trainingSet) = j5(trainingSet9) and (traingSetO) is a softmaxmodeIled probability of the set of observations (trainingset) parameterized by the neural network parameters 9. This can be interpreted as, given a family of model probabflity distributions, finding the probabifity distribution in that family which predicts the set of observations (train ingset) with the highest probability.

It is assumed that the observations ( v vOl) in the set of observations (in each trainingSet1) are independent and that the ikeUhood of the set of observations L(GftrainingSet = (trainingSet.6) can thus be expressed as L(9trainingSet) = (trainingSet9) = fi [J fl fi(w[.J wf'°; 9) UEU wrw(U) fl=i where and are the domains that occur at positions 1" and 1+f in the sequence w" representing useru's browsing history these wiH he some domains (k) and Uë) in the vocabulary V, and $(v) ;jftV); o) wifi be given by the softmax function of equation (1) i.e. j3(v;jv('d; 9) = softmaxkJ(dk).

The presence of each term 13(v(k);jv(k'); e) in the above effectively encodes the fact that a skipgram paring (v;jv()) has been observed somewhere in the data; the

N

presence of a total of N of these terms (i.e. the presence of (v(tc3;jiv); o) effectiv&y encodes the fact that n such skipgram pairings in tota have been observed in the training data as a whole. In other words, the likelihood effectively encodes which domains have been observed sufficieniiy near to one another to indicate a pos&ble conceptual simUarity therebetween, and also how often those is domains have been so observed.

In performing an optimization procedure equivalent to maximising the likeUhood L(6trainingSe, he information encoded in t(Bftrainingset) s effectively conveyed to the neural network (i.e. the neural network is effectively taught that information) To aid understanding, consider as a simpilfied example with a context window C=1 (i.e. that only considered pairs of domains that are immediately adjacent) and an example a training sequence exampie5.com, example3. com, "example9. corn" in this case, the fact that the domain name Mexample3 corn" occurs in the sequence with domains "exampie5.com" and "example9.com" either side results in the observations ("example3. comiexample5.com) and ("example3. com,"example9.corn) being Included in the set of observations s (tralningSet) on the training data (trainingData) comprising sequences [wz)]. This means that the likelihood L(OltrainingSet) will Include the following terms In the overall product C'exampIe3. corn"; (-1)J"exarnple5.com"; 8) * example3. corn"; (+1)I"example9.com"; 8) among others.

Considered across all training sequences, suppose "example3.com" is immediately preceded by "exampie5.com" at M different places in the various sequences and that "example3.com" is Immediately followed by example9.com" at N different places in the various sequences (w(h1)]. This will result In terms in the overall product C'examp1e3. corn"; (-1)"exampIeS.com"; 8?' * C'exampIe3. corn"; (+1)"exarnple9.com"; 0)M among others.

Maximizing the likelihood L(OtraInIngSet) of the set of observations is is mathematically equivalent to minimizing a "cost function" defined as the negative log-likelihood 78trainingSet) of the set of observations, which Is L(Oltrain!ngset) = -log L(OjtrainingSet) = _iog(wjw0;o) (2) uU wjcwC') jJjntl ICI where -log p(wtJ jw; o) is given by the softmax function of equatIon (1). That is, equation (2) represents a summation of the negative logarithm of outputs lbrm output neurons 7 of the output layer 6. Whilst these are mathematically equivalent, minimizing the negative logiik&Thood is cornputationay cheaper and is thus favoured.

At step Si 0. the parameters of the projection layer 4 ftc. the projection matrix P) and the parameters of the output layer 6 (i.e. the weight vectors are updated to try s and match the outputs of the output layer 6 to the set of observations (trainingset) generated form the training data.

The gradient of the cost function L(9. trainingSt) is V L(Ot:'»=iiningSeO where (v t(OftraningSet))0 = 4L(OftrainingSet) q for each component 9q of 0 (and OqWIl be either Pjcrn for some values of Thm i.e. a to component of 4 for that value of k, or 6q wHl be some component of for some values of k',j -i.e. a component of the projection layer projection matrix P or a component of an output layer weight vector).

As wiU be appreciated, the cost function L(Oftrainingset) is substantially minimized when V L(Oftrainingset) 0, at which point the outputs of the output layer generated from each input at the input layere. for each t(c) E V) will substantiafly match the set of observations (trainingSet) on the training data (trainingUata).

Steps 310 can, for instance, he implemented using a gradient descent algorithm, which is a firstorder optimization algorithm as follows: the parameters vector 8 (who's components are the components of the projection matrix P and the components of the output layer weight vectors) has a current parameter vector value O at this point which is updated to a new parameter vector value 8 by an amount A that is (at least approximately) proportional to the negative of a current gradient vector value of the negative gradient-V L(0trainingSeU as evaluated for the current likelihood value he.: 0 i-s = 00+ A where a cc (-v L(0ItrainingSet))I Evaluating the current negative gradient -V 1(0(train!ngset) will involve evaluating different softmax functions p(wfj1wf'; o) generated at the output layer based on the current parameter vector 0.

At step 812, it is determined whether or not the cost function is substantially minimized by (i.e. the gradient substantially zero for) the new vector value 01 of 8 e.g. it is determined it a new gradient vector value (-v E(0ItrainingSet))g1, evaluated for °1, Is substantially 0. If so, the training Is complete and the method proceeds to step 814. If not the method returns to SlO where the parameters 0 (i.e. the projection matrix P, an output weight vectors Ck:j) are once again adjusted and new outputs once again generated In a further attempt to minimize the cost function.

At Step 814. once the training (810-512) has been completed a respective continuous (semantic) vector is assigned to each domain name Qt) in the vocabulary V. The respective semantic vector that is assigned to each domain v" Is the corresponding projection layer vector d,, which corresponds to the kth row k.(.) of the projection matrix P once the training has been completed (i.e. following steps S10-S14) I. That Is, at the end of training, the weights between the projection layer 6 and the input layer 4 are assigned as the learned continuous vector representations ci,, for each domain v" e V. This Is illustrated In figures 4 and 6B.

As mentioned above, ft Is these semantic vectors dk that have been observed to capture conceptual similarities between different domain names provided those conceptual similarities are reflected in the relative ordering of those domain names as they appear In the user-level browsing history training sequences. That Is, whenever conceptual similarities between different domain names are captured in the relative ordering of those domains in the user-level browsing history training sequences those conceptual similarities are realized as geometric simIlarities between their semantic feature vectors dk = (Pkm) in their semantic feature space, or at least in a restricted subspace thereof.

The semantic vectors 4 have dimension IDI which may be substantially less than the size fyI of the vocabulary V. It has been observed that a size fDf of around 200 Is sufficient for the purposes of domain categorization (see below). This represents a significant dimensionality reduction with respect to the 1-of-V encoding which is also advantageous In terms of computational efficiency. The Inventors have found that 101=200 Is a fair compromise between training speed and representational power. Needless to say, networks with larger IDI take longer to train but can encode more complex or subtle semantic relationships than smaller IDI. A reasonable range of values of jDf Is between about 50 and 500.

The semantic vectors 4 inhabit a fDf -dimensional vector space or "feature space", and each defines a domain point in that feature space corresponding to the domain name,/k) is Thus, In accordance with the method, each domain E V thus has two vector representations -an initial high-dimensional vector representation (fVf dimensions) as a discrete' vectory of the form (0,0 1,.. .,0) with the 1 In a different column for each domain; -a (possibly much) lower dimensional vector (fDf dimensions) as a continuous' vector ilk of the form e.g. (0.12,-0.04,...).

At step Si 8, the assigned sematlc vectors ilk are matched in order to categorize domain names in V such that different domain names with similar semantic vectors (that Is nearby In at least a restricted subspace of the semantic vector space) are assIgned to the same domain category or domain categories.

For example, In embodiments, the domain namesv' may be categorized by performing a clustering procedure on the semantic vectors 4.

A domain name may be assigned to a category by storing that domain name in association with a category identifier (e.g. a numerical or textual identifier) of that category in memory 24. Each domain name may be stored In association with more than once category Identifier Is It is assigned to more than one category.

HavIng a 1-of-V Input layer Is equivalent to selecting the relevant continuous vector to provide to the projection layer. Thus, in practice the input layer can be replaced with a lookup to find the correct continuous vector. That Is, the projection layer may be Implemented as a lookup table that maps domains (t) V in the vocabulary to their continuous vector representation d, (which are updated in the lookup table at each training Iteration).

FIgure 7 is a flow chart for a K-means clustering algorithm that may be used to categorize domains in V. Initially, at 822, K category points each representing a different category are disposed randomly in the IDI dimensional feature space Inhabited by the semantic is vectorsdk.

At S24. each domain 0t) V (or at least each domain in some subset of V) Is assigned to the category having a category point closest to that domain's domain point define by dk.

At S26, each category point Is moved such that the average vector distance from each category poInt domain points assigned to that category is zero. At this stage, in so moving the category points, It is likely that some of the domain points will now be closest to a different category that they were originally.

Thus, at 828, It Is determined whether or not any category reassignments are necessary. if so, the algorithm returns to S24 so that each domain point Is once agaIn assigned to the closets category point, and following that reassignment, the category points are once again moved so that the average vector dIstance from each category point to the objects points of the objects assigned to that category is zero (826).

if at S28 it is determined that no reassignments are necessary, the algorithm ends with the domain names having been successfully categorized.

One run of steps S22-S28 constitutes an iteration I of the clustering algorithm.

An example of this is shown in figure 6. Figure 6 shows a plurality of domain points 102 representIng different domains In an exemplary two-dimensional feature space F of two-dimensional semantic vectorsdk. Each of the points 102 is Initially unassigned but is Intended to be assigned to one of two categories in this example, category 1 (104) and category 2(106). A first iteration 10 of the algorithm commences by disposing category points 104, 106 for each of the two at random posItions In the feature space F. Each of the domain points 108 closest to the category I point is then assigned to category I and each of the domain points 110 closest to the category 2 point Is assigned to category 2 (the dotted lines shown In figure IA denote a line equidistant between the two category points 104, 106 In the feature space F). To complete the IteratIon 10. each of the category points 104, 106 Is moved to new locations in F such that the average vector distance from 102 (rasp.

104) to the domain points assigned to category I (resp. category 2) is substantially zero. A second Iteration II of the algorithm commences by updating the category assignments to reflect those new locations of the category points. As a result, in this example, two domain points 112 are reassigned from category 2 to category I and another two domain points 114 are reassigned from category ito category 2. To complete the second Iteration II, the category poInts 104, 106 are once again moved to new locations in F such that the average vector distance from 102 (rasp. 104) to the domain points assigned to category I (resp. category 2) is once again substantially zero. As this does not necessitate any further reassignments In this example, the algorithm terminates accordingly.

Once the domain names (") have been successfully categorized, human readable labels may be assigned to each of the categories.

In some embodiments, this is a manual process e.g. with a meaningful label being assigned by manually looking at, say, a subset of domains in each category and assignIng a meaningful label accordingly.

In other embodiments, the labelling process may be automatic. For instance, given a category of multiple domain names, those domain names could be processed to recognize a word that occurs frequently in at east some of those domains, or to recognize some other shared attribute exhibited by those domains (e.g. any common sequences of character e.g. text characters). For example, in the case of category of domain names pointing to foo&r&ated websites, a significant number of these may have the word lood" somewhere in the domain names themselves. This can be recognized autornaticaHy, and the label "food" apped automatically to the relevant category.

to It should be noted that, whilst domains that end up in similar categories may have textuafly similar domain names, this is not the criteria on which semantic vectors are as&gned, nor is it the criteria on which domains are categorized. Rather, as discussed. domains are categorized based on historical user browsing actMty, which is reaUzed by way of the assignment and subsequent matching of semantic vectors generated by training the neural network 1.

Live Phase A method of delivering targeted content to users wifl now be described with reference to figure 9. The method utHizes the category assignments determined in the categorization phase(that is, in performing the categorization method of figures 6A and 6B).

At step S42, a current user's browsing activity is detected, The current user may or may not be a user from whom browsing history data was coflected and used in the categorization method. The browsing activity is detected via the network 14 at a user device of the current user which is connected to the network 14. The current browsing activity comprises the current user accessing at least one domain identified by a domain name v(i) that has been assigned to a particular domain category in the categorization phase. For example. a notification of the current browsing activity may be received at the server 20 via the network 14 from the user device of the current user, that notification comprising the domain name of the accessed network domain.

At step S44, the detected current browsing activity is matched to at least one domain category. Specifically, it is determined that the at least one domain accessed by the current user is identified by a domain been previously assigned to the particular domain category.

At S46, content appropriate to the particular domain category is selected in memory 24. For example, where the particular category is a category of domains relating to a particular topIc (e.g. knitting), content comprising information relating to that topic (e.g. comprising knitting-related information) may be selected. The content may for instance be advertising content.

Pieces of content may be stored in memory 24, each stored in association with a respective domain category identifier of a category to which that content is relevant.

A piece of content may be selected by matching a category kientifler assigned to the accessed domain (assigned in the training phase, and accessed In the live phase) to a category Identifier associated with that piece of content.

At step S48, the selected content is transmitted via network 14 to the user device of the current user for delivery (e.g. output via a display screen andi or loudspeaker(s) of that user device) to the current user.

in this way, content (e.g. advertising content or TMad) that is targeted to a current user's current browsing activity, and therefore likely to be of relevance to that current user, is automatically selected In a robust and reliable manner due to the robust and reliable categorizations determined in the categorization phase.

In practice, each user (e.g. 16a, 16b etc.) may be assigned to one or more interest groupC based on that users general browsing actMty. For Instance, If the user has browsed one or more websites in a particular category (e.g. Identified as relatIng to a particular topic e.g. travel), they may be assigned to an interest group associated with that category (e.g. an Interest group associated with that particular topic e.g. a travel Interest group). That way, e.g. a travel-related ad can be displayed to a user who is believed to be interested in travel regardless of the site the user is on at the moment (even if it is not Sated to travel). Users are typically identified within the network 14 by user identifiers, such as network (e.g. IP) addresses of their devices 18a, 18b) or some anonymized versions thereof, and users may be assigned to interest group(s) by storing an association between their user Identifiers and that/those Interest group(s).

s It should be noted that the term sser as used in the above may not refer to a set In the strict mathematical, set-theoretic sense as importantly at least some of these sets, such as the set of observations (tralnlngSet), can have a defined concept of the number of times an element appears in those sets (in contrast to a strict mathematical set for which ta, ol a (a}).

Whilst In the above, a negative log-likelihood Is minimIzed to train the neural network, In alternative embodiments a positive log-likelihood may be optimized e.g. using a gradient ascent procedure. As used herein, the term log-likelihood" Is intended to cover both positive and negative log-likelihoods, which may be optimized is by maximization or minimization as appropriate.

It will be appreciated that the above embodiments have been described only by way of example, and other variants or applications may be apparent to a person skilled in the art given the disclosure herein. The scope is not limited by the described exampies but only by the following claims.

Claims

Caims: 1. A computer-4mpemented method of categorizing a puraflty of domain identifiers of network domains using a neuraM network having a hidden layer connected to an output Mayer, the method comprising: S accessing a phiraty of sequences of training data, each representing a respective user's historicaM browsing histori and consfituting a sequence of domain identifiers of network domains accessed by that user having positions in that sequence which convey the order in which those domains were accessed; training the neural network to model relationships between different domain identifiers based on their positions in the sequences of training data reMative to one another, the step of training comprising modifying parameters of the hidden layer; assigning semantic vectors to the plurality of domain identifiers based on the modified parameters of the hidden layer of the trained neural network; and assigning categones to the pluraUty of domain identifiers by matching their is assigned semantic vectors.
2. The method of claim 1 wherein the step of training comprises opUmizing a modelled Mkellhood of the sequences of training data, the step of optimising comphsing modifying the parameters of the hidden Mayer.
3. The method of claim 2 wherein the likelihood that is optimized is a modelled og4keUhood of the training data.
4. The method of claim 2 or 3 wherein the neural network is a skip-gram neural network, the output layer being configured to compute modelled skipgram probabihties for different skipgram pairs of domain identifiers, and the Hkellhood that is optimized is a modelled likeUhood of a set of observed skipgrams observed in the training sequences.
5. The method of any preceding daim wherein the output Mayer is operaNe to compute suhstantiaUy softmax probabilities based the hidden layer parameters.
6. The method of claim 5 wherein the output layer is configured to compute Herarchical softmax probabUities.s
7. The method of any preceding claim wherein the hidden layer comprises between 50 and 500 nodes, the semanfic vectors thereby having substantially that number of dimensions.
8, The method of any preceding claim wherein the step of categorizing comprises performing a clustering Sgorithm on the semantic vectors,
9. The method of any preceding claim wher&n each of the domain identifiers is a domain name.is
10. The method of any preceding claim further comprising, for at least one category, processing domain identifiers assigned to that category to detect a shared attribute exhibited by at least some of those domain identifiers, and automatically assigning a category label to that category based on the shared attribute.
11. The method of claim 10 wherein the shared attribute is a sequence of characters that is present in each of the at least some domains.
12. The method of any preceding claim wherein the users' historical browsing histories are anonymized historical browsing histories.
13. The method of any preceding claim wherein the neural network is trained to model said association based only on the domain identifiers and their conveyed order and not based on any content obtained from the identified domains.
14. The method of any preceding claim wherein the hidden layer is a pro1.ection layer.
15. The method of any preceding cilaim wherein the hidden layer is implemented as a lookup table configured to map each of the pluraty of domains to its respective semantic vector representafion.
16. A method of delivering targeted content to a current user of a network comprising a pluraty of network domains identified by a plurailty of domain identifiers, the method comprising: in a training phase, performing the method of any preceding claim to assign categories to the pluraUty of domain identifiers; in a live phase: detecting current browsing activity in the network by the current user, the current browsing activity at a user device associated with the current user and comprising the user accessing at least one of the identified network domains; identifying at east one category that has been assigned to the domain identifier of the accessed network domain in the categorization phase; selecting content for delivery to the current user based on the identified category; and transmitting the selected content to the user device for outputting to the current user.
17. A method of delivering targeted content to a user of a network comprising a pluraUty of network domains identified by a plurality of domain identifiers, the method comprising: in a training phase, perForming the method of any of claims I to 15 to assign categories to the plurality of domain identifiers; in a live phase: accessing a browsing history of the user, the browsing history identifying at least one network domain that has been accessed by that user; identifying at least one category that has been assigned to the domain identifier of the accessed network domain in the categorization phase; assigning the user to at least one interest group based on the identified category; selecting content for delivery to the user based on the assigned interest group; and transmttting the selected content to a user device associated with the user for outputting to the user.
18. A computer readable medium storing code configured, when executed, to implement the method of any preceding claim.
19. A computer system comprising: computer storage holding a plurality of sequences of training data, each representing a respecflve users historical browsing history and constituting a sequence of domain identifiers of network domains accessed by that user having positions in that sequence which convey the order in which those domains were accessed; is one or more processors configured in a categorization phase to train a neural network, having a hidden layer connected to an output layer, to mod& relationships between different domain identifiers based on their positions in the sequences of training data relative to one another, the step of training comprising modifying parameters of the hidden layer; to assign semanUc vectors to the plurality of domain identifiers based on the modified parameters of the hidden layer of the trained neural network; and to assign categories to the plurality of domain identifiers by matching their assigned semantic vectors.
20. The computer system of claim 19 further comprising a network interface for connecting to a network comprising the pluraliLy of network domains; wherein the processors are configured in a live phase to detect via the network interface a current browsing activity in the network by a current user of the network, the current browsing activity at a user device associated wtth the current user and comprising the user accessing at least one of the identified network domains; to identity a category that has been assigned to the domain identifier of the accessed network domain in the categorization phase; and to select content for delivery to the current user based on the identified category; and wherein the network interface is configured to transmit the selected content to the user device of the current user.
21. The computer system of daim 19 further comprising a network interface for connecting to a network comprising the plurahty of network domains; wherein the processors are configured in a live phase to access a browsing history of a user, the browsing history identifying at east one network domain that has been accessed by that user; to identify at least one category that has been assigned to the domain identifier of the accessed network domain in the categorization phase; to assign the user to at east one interest group based on the identified category; and to select content for devery to the user based on the assigned interest group; and wherein the computer system comprises a network interface configured to transmit the selected content to a user device associated with the user for outputting to the user.