GB2452555A

GB2452555A - Identification of insecure network nodes, such as spammers, using decoy addresses

Info

Publication number: GB2452555A
Application number: GB0717523A
Authority: GB
Inventors: Georgios Kalogridis
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2007-09-07
Filing date: 2007-09-07
Publication date: 2009-03-11
Anticipated expiration: 2027-09-07
Also published as: GB2452555B; GB0717523D0

Abstract

A method of identifying one or more insecure nodes amongst a set of nodes is disclosed. Each of a plurality of subsets of nodes is provided with a respective decoy address. For example, a first decoy address could be supplied to nodes 4, 6 and 8; a second decoy address to nodes 4, 10 and 12; and a third decoy address to nodes 4, 14 and 16 etc., as indicated in table 1. Receipt of messages at the decoy addresses is monitored and the combination of decoy addresses which have received at least one message is identified. An example is given on page 18 in which messages are received at the first five decoy addresses but not at the sixth or seventh decoy addresses. Knowing the combination of decoy addresses which received messages, it is possible to deduce which of the nodes are insecure. In the example of page 18 the first and second nodes 4, 6 are identified as insecure because neither was tested using the sixth or seventh decoy addresses and these were the only two addresses not to receive messages. This combinatorial approach lures insecure nodes into thinking that their behaviour is not being monitored. Non-adaptive group testing algorithms may be used to assign nodes to subsets for testing. The invention may be used to identify e-commerce web sites which collaborate in issuing spam mail to customers.

Description

I

Identification of insecure network nodes The present invention relates to data privacy and network security evaluation, for example identification of insecure hosts or nodes present within a network.

Data privacy is an important security subject and is increasingly heavily regulated.

However, in the era of globalisation of dynamic, heterogeneous networks and the collection and exploitation of private data in a plethora of large scale databases, it is not easy to enforce regulations and security policies.

Spamming is one direct consequence of infringement of private data, and the present invention may, in one aspect, be used to identify insecure hosts who leak data that is subsequently used by spammers.

The problems posed by spam have grown from simple annoyances to significant security issues. In addition to the wasted time spent viewing and deleting spam, spam also poses security risks including:-identity theft; phishing and other spam-related scams; and viruses, worms, and malware.

Spammers may obtain user data, including user e-mail addresses, from hosts to which the data is supplied by a user, either wittingly or unwittingly. Such hosts may include, for instance, e-commerce websites or other websites where a user is requested to supply information, for instance via on-line forms. Sparnmers may obtain information from such hosts via unauthorised access to the records of such hosts, or spammers may set up their own dummy hosts specifically in order to gather user data.

Various anti-spam solutions have been proposed and a few have been implemented.

Current anti-spam solutions fall into four primary categories: spam filters, reverse lookups, challenges, and cryptography based solutions. Each of those categories is now considered in turn.

Spam filters attempt to identify spam and thus limit a recipient's exposure to spam.

However, filters do not prevent spam. There are many different types of spam filter systems including: word lists; black lists and white lists (based on IP addresses); hash-tables (for detecting bulk mailing); artificial intelligence-based systems and probabilistic systems, that operate by self-learning. These filter-based anti-spam approaches have various significant limitations including:-the possibility of bypassing the filter; the problems of false-positives and false-negatives, in which messages are incorrectly identified as spam or not spam; and the necessity for spam mailbox management, requiring significant user input.

Reverse-lookup systems attempt to address problems of forgery of a sender's address in spam mails. For example, DNS is a global network service that is used to match IF addresses with hostnames and vice versa, enabling identification of messages that originate from an IF address that is inconsistent with the apparent hostname. It is also possible to maintain mail exchanger (MX) records, in the same way that DNS maintains domain names. When delivering email, a mail server may determine where to pass the message based on the MX record associated with the recipient's domain name. While reverse lookups are viable in closed environments, such as a corporate internal network, the solutions are not general enough for worldwide acceptance. They are also unsuitable for host-less and vanity domains, or for mobile computing.

Challenge-based techniques attempt to impede bulk-senders by slowing the bulk-mailing process. Users that send a few emails at a time should not be significantly impacted by such challenge-based techniques. There are two main types of challenges: challenge-response and proposed computational challenges. Challenge-Response (CR) systems maintain a list of permitted senders. E-mail from a new sender is temporarily held without delivery. The new sender is sent an email that provides a challenge (usually a click on a URL or reply email). CR systems have a number of limitations including: CR deadlock (if two people both use CR systems, then they will not be able to communicate with each other); automated systems cannot respond to challenges (such emails are unexpected and unsolicited, but not necessarily undesirable); interpretation challenges (character / pattern recognition). Computational Challenge (CC) systems attempt to add a "cost" to sending email. Most CC systems use complex algorithms that are intended to take time. The currently proposed computational challenges are unlikely to be widely adopted -they do not appear to mitigate the spam problem and do appear to inconvenience legitimate mailers.

A few solutions have been proposed that use cryptography to validate the spam sender.

Essentially, these systems use certificates to perform the authentication. Unfortunately, these cryptographic solutions are unlikely to stop spam as they do not validate that an email address is real -they only validate that some sender has signed the email with some key.

Furthermore current cryptographic art is unable to thwart passive attacks on mobile objects (e.g. security agents discussed below). As such, any malicious hosts may access an email address and private data contained in an agent and use it for unsolicited spamming purposes. As email addresses are typically exposed to a large number of hosts, tracing the guilty host or hosts is non-trivial.

It has also been suggested to maintain reputation lists for hosts. E-mail servers could then use these dynamically updated lists in order to classify spam emails. However, no viable solutions for providing reliable reputation lists have been suggested.

In addition to the systems mentioned above, it has also been suggested to use decoy e-mails to detect spam e-mail in Korean Patent No. KR 20040038173, in which a decoy mail address writer writes a single decoy mail address to a website, and a detector detects mail received at the single decoy mail address. The evaluation process is simple as the website is tested on a one-to-one basis.

One problem with the system mentioned in KR 20040038173 is that the lure for the potential insecure hosts may be limited. This means that a malicious host may not always behave maliciously, if it suspects that is being evaluated by a decoy entity.

Furthermore, for testing a large number of hosts, such a decoy system would be difficult to implement in practice as they require at least as many decoy entities to be maintained, transmitted and monitored as there are hosts to be tested.

Security evaluation of hosts has also been studied in relation to mobile agent security research area for evaluating trust in remote environments. Spy agent security systems have been described in GB2415580 and GB2428315, both in the name of Toshiba Research Europe Limited.

Spy agent secunty can be classified as a pre-emptive security evaluation technique for protecting agents against malicious hosts. For instance, a security assessment can be facilitated with agent migration to unknown hosts by assuming that target hosts provide the agents with the security information they request.

However, one problem with the use of such security agents is that a corrupted remote host could detect incoming security agents and selectively behave well in order to escape detection. The same host could behave inappropriately when it has the opportunity to cheat without being detected.

The present invention aims to provide an alternative method and apparatus to those mentioned above. None of the known techniques, such as those described above, provide a technical solution to the problem of identifying insecure hosts, whose actions may lead to the generation of spam e-mail, amongst a set of hosts. The known techniques are generally used in combination, and require ongoing user input and management.

It is an aim of aspects of the present invention to provide a technical solution to the problem of identifying the origin of a security breach leading to the receipt of spam e-mail or to other unwanted network activity.

In a first, independent aspect of the invention there is provided a method of identifying one or more insecure nodes among a set of nodes, comprising:-providing each of a plurality of decoy entities to a respective subset of the set of nodes, each decoy entity comprising a respective decoy address; monitoring the decoy addresses; and processing the results of the monitoring to determine the combination of decoy addresses that have received at least one message and to identify a node or nodes as being insecure consistent with that combination.

By providing a plurality of decoy entities, and by identifying a node or nodes as being insecure consistent with the combination of decoy addresses that have received at least one message, the claimed method provides an alternative, combinatorial approach to the identification of insecure nodes, and may provide improved efficiency, reliability and security.

Preferably each node comprises a respective host.

The term host as used herein refers to a node, also referred to as a network location, that hosts a particular application or service. Such applications or services may include webpages or websites. Each node has a respective, unique address. Hosts and nodes may be implemented in hardware andlor software. In the following discussion, reference is made to both hosts and nodes, and for the purposes of that discussion it should be considered that each host is in a one-to-one relationship with a respective node.

The number of decoy entities and the subsets of nodes selected for those decoy entities, considered together, ultimately determine what information it may be able to extract by performance of the method. The number of decoy entities and the collection of subsets of nodes of the decoy entities may be referred to as the test design, or merely design.

The method may have particular application in relation to e-commerce security, and each node may comprise an e-commerce site.

The method may, ultimately, deter nodes or hosts from exploiting private data for unsolicited purposes as the nodes or hosts may be wary that there is a risk that such unsolicited exploitation may be detected if the method is applied to their activities.

The method may be used in conjunction with, and enhance, the functionality of known related security mechanisms dealing with spamming, phishing, or violation of privacy.

The method may complement existing reputation-based security evaluations, prevention-based private data protection systems, fraud detection systems, and anti-spam security systems.

For example, the method may be used:-: -to identify e-commerce nodes that do not treat sensitive information in accordance to their published security policies (data privacy infringers); -in maintaining a reputation table for nodes; -to deter such nodes from misusing personal data; -to aid spam filters to maintain better black lists; or -to aid web browsers in maintaining reputation tables for remote nodes The method may be used to evaluate trust in remote environments, and preferably remote nodes should be unaware that their trust-worthiness is being tested. The structure of the decoy entities is preferably the same as the structure of a generic e-commerce application entity.

The method may have particular application by, for instance, laptops, PC equipment, mobile or handheld phones, internet security providers, certificate authorities, or government security agencies.

At least one of the subsets of nodes may comprise a plurality of nodes, and preferably each of the subset of nodes comprises a respective plurality of nodes.

By ensuring that plurality of nodes are included in each of the subsets, it is possible to test nodes in groups instead of one-by-one and it can be ensured that fewer decoy entities and less time for testing are needed. That is of particular importance when routinely performing large scale tests.

Furthermore, having larger subsets for each decoy entity makes the decoy entity more luring for any potentially insecure node and may yield more credible results. The more nodes are included in the subset of nodes of a decoy entity, the less an insecure node may suspect a spying scenario, the more it may evaluate that it can cheat without being detected, and the more it may be enticed to misbehave, thus revealing its insecure character.

The processing may comprise determining for each subset of nodes whether that subset of nodes contains at least one insecure node, and comparing the outcome of the determination for each of the subset of nodes.

Preferably a collection of decoy private data is exposed to combinations of nodes in a way that detects potential spammers or other insecure nodes while maximising the decoys entities' lure.

Each decoy entity may comprise data, such as an email address or addresses, that enables each of the nodes in the subset of nodes for that decoy entity to become aware of the identity of the other nodes in the subset of nodes.

If such data is included in a decoy entity, then any insecure nodes may be more easily lured to behave insecurely with regard to the entity as they would be aware of the other nodes to which the entity is exposed. They may therefore consider that any insecure behaviour on their part may not be easily attributable to them.

Alternatively or additionally each node may be able to collude with each other node in each subset of which it is a part so as to obtain information concerning the decoy entities encountered by said each other node. An e-mail message, or other data sent under a traditional client-server interaction may be used, as a software agent may not be suitable in that case. If e-mail messages are used as the decoy entities then each such e-mail message may be sent to a sub-group of recipients in order for all said recipients within each subgroup to be aware that they all receive this email message. Alternatively, if other data is used as the decoy entity then other data may be sent separately to each of the nodes in its subset. The nodes may still be aware of the other nodes that had received the same e-mail or other data due to the possibility of collusion between the nodes (satisfying certain requirements). In these cases a potentially insecure node again may be induced to act insecurely in light of the described awareness.

The method may comprise selecting the respective subset of nodes for each decoy entity so as to ensure that the insecure nodes are identified correctly and unambiguously.

The respective subset of nodes for each decoy entity may be selected in accordance with a group testing algorithm, and the group testing algorithm is preferably a non-adaptive group testing (NGT) algorithm.

Group testing algorithms are algorithms that can be applied to a set of parameters relating to a group of items containing a number of defective items in order to obtain a suitable test design which identifies the defectives. Group testing algorithms have been found to be particularly useful in determining the number of decoy entities and the particular subsets of nodes for those decoy entities that enable the correct identification of insecure nodes for any given set of nodes, while maximising the reliability and the security parameters of the decoys.

A design that represents a group testing algorithm may be referred to as a group testing design. Preferably the number of decoy entities and the respective subset of nodes for each decoy entity comprise a group testing design, preferably a non-adaptive group testing (NGT) design.

The use of group testing algorithms may enable large-scale tests to be performed efficiently (i.e. in a reduced number of steps or reduced time). The use of group testing algorithms or group testing designs is suitable for both comnierciallconsumer products as well as large security agencies.

Preferably the processing is carried out in accordance with the group testing algorithm.

An NGT design may be represented by a disjunct incidence matrix. Preferably a disjunct incidence matrix may be constructed by a t-(u,k,X) design (or simply t-design), which is a combinatorial block design.

In normal circumstances, it may be assumed that if the decoy entities obtained from a particular NGT design have an adequate lure, then when performing the method based upon that design, all insecure nodes being tested will deterministically misuse information exposed by the decoy entities in a detectable way, as discussed, and will thus reveal their insecure nature.

If the assumption set out in the preceding paragraph is incorrect for a particular set of nodes and if one or more insecure nodes of the set acts stochastically (for instance, they randomly misuse only a certain number of email addresses), then an NGT algorithm may be selected that is robust to such errors, providing that the errors are not too many.

D..J.Balding and D.C.Torney, Journal of Combinatorial Theory Series A, 74(1), 131- 140, 1996 and references therein may be referred to for further information concerning such robust NGT algorithms.

The method may further comprise selecting the group testing algorithm in dependence upon at least one characteristic of the set of nodes.

The group testing algorithm may be selected, for instance, in dependence on the number of nodes in the set of nodes andlor the maximum number of insecure nodes that are expected to be present within the set. The maximum number of insecure nodes that are expected to be present may be estimated using known security evaluation techniques.

Alternatively, the size and make-up of the set of nodes may be selected so as to be suitable for a particular group testing algorithm that is to be used.

The subsets of nodes may be such that the method is able to identify correctly up to a pre-determined number of insecure nodes, and preferably the method further comprises generating a signal if more than the pre-determined number of insecure nodes is identi fled.

The method may comprise selecting the respective subset for each of the plurality of decoy entities so as to maximise the number of nodes in each subset and/or so as to minimise the number of nodes in common between any pair of the decoy entities and/or so as to maximise the number of insecure nodes that can be identified correctly.

Application of one or more of those conditions has been found to provide optimal, or at least improved, efficiency and/or accuracy.

Preferably the subsets of nodes are such that the respective subset of nodes for each of the decoy entities has no more than one node in common with any other of the subsets of nodes.

The monitoring may be performed for a pre-determined period of time, and the processing is preferably performed on or after expiry of the pre-determined period of time.

Alternatively, the processing may be performed whilst the monitoring is still going on.

The processing may be repeated periodically during the performance of the monitoring.

Alternatively the processing may be triggered by activity at the decoy addresses, for instance the receipt of a message or a predetermined number of messages. The method may comprise halting the monitoring once more than the or a pre-determined number of insecure nodes is identified. Thus, the method may be halted once it has become clear that a correct outcome may not be obtained.

Each entity may comprise a software agent or a data message or an e-mail message, and may include unique decoy data, such as a decoy email address.

Unlike malicious entities such as viruses, decoy software agents are legitimate entities in the sense that they interact with nodes in the way expected by the visited nodes. As such they are analogous to honeypots.

The structure of decoy software agents may include a pre-coded routing scheme, which nodes are usually able to access, in order to facilitate migration / routing, a personal email address and other related information. All these elements may be digitally signed and time-stamped in order to enable authentication, non repudiation and reliability.

In a further independent aspect, there is provided apparatus for identifying one or more insecure nodes amongst a set of nodes, comprising:-means for providing each of a plurality of decoy entities to a respective subset of the set of nodes, each decoy entity comprising a respective decoy address; means for monitoring the decoy addresses; and processing means configured to determine the combination of decoy addresses that have received at least one message and to identify a node or nodes as being insecure consistent with that combination.

Preferably each node comprises a respective host.

Each decoy entity may comprise data that enables each of the nodes in the subset of nodes for that decoy entity to become aware of the identity of the other nodes in the subset of nodes.

The apparatus may further comprise means for selecting the respective subset of nodes for each decoy entity according to a group testing algorithm, and the group testing algorithm is preferably a non-adaptive group testing (NGT) algorithm.

Preferably the apparatus further comprises means for selecting the group testing algorithm in dependence upon at least one characteristic of the set of nodes.

The apparatus may further comprise means for generating a signal if the apparatus identifies more than the pre-determined number of nodes as being insecure.

The apparatus may further comprise means for selecting the respective subset for each of the plurality of decoy entities so as to maximise the number of nodes in each subset and/or so as to minimise the number of nodes in common between any pair of the decoy entities and/or so as to maximise the number of insecure nodes that can be identified correctly.

The selecting means may be configured to ensure that the respective subset of nodes for each of the decoy entities has no more than one node in common with any other of the subsets of nodes.

The monitoring means may be configured to monitor the decoy addresses for a pre-determined period of time, and the processing means is configured to perform the processing on or after expiry of the pre-determined period of time.

Each entity may comprise a sofiware agent, or a data message, or an e-mail message.

In a further, independent aspect there is provided a computer program product storing computer executable instructions operable to cause a general purpose computer communications apparatus to become configured to perform a method in accordance with any one of claims I to 10 or to become configured as apparatus in accordance with any one of claims 11 to 20.

In a further aspect there is provided a collection of decoy entities that expose decoy private data (including decoy email addresses) to combinations of target nodes.

Preferably there is provided a design that determines where each decoy entity exposes what, such that spammers or infringers may be lured to act insecurely. Thus, judging from potentially received spam or infected email, the spammers or infringers may be identified, while preferably maximising the decoy's lure (and hence the result's reliability).

A group of decoy spy entities that may contain pseudo-personal private data, including email addresses, may be exposed to subgroups of remote hosts, implementing a decoy honeypot system that analyses received spam emails or infected emails or fraud-based emails in order to identify which hosts are the infringers and which are not. Given a set of spying requirements, the task of designing routes (that determine where each spy's private data are exposed), such that the hosts can be classified from the detectable outcomes, may be treated as a group testing problem. Non-adaptive group testing, in particular, may provide an efficient optimal or sub-optimal combinatorial construction.

Exposing decoy entities to target hosts in a way that makes them more luring to infringers may improve security. Efficiency may be improved by reducing the number of decoy entities needed to test a certain group of target hosts. Lists of known combinatorial constructions may be used to find an optimum or near-optimum solution in specific scenarios where certain parameters are given.

There is also provided a network security evaluation system in which the integrity of a set of remote hosts with regards to spamming and data privacy is tested with a series of legitimate decoy entities, wherein each decoy entity contains a unique pseudo personal email address and unique pseudo personal private data, the system comprising:-means for exposing said pseudo personal information to distinct subsets of the said remote hosts; means for making the decoy luring by allowing each remote host within each said distinct subset to be aware that the said pseudo personal private data are exposed to all other remote hosts within the said distinct subset; means for applying an algorithm for deciding the subsets of remote hosts for each of the decoy entities; means for constructing the algorithm so that the lure is maximised by maximising the number of target hosts in each subgroup of hosts and by exposing any two decoy entities to no more one common target host; means for constructing the algorithm so that all the hosts are classified as either honest or dishonest; the later being the ones who have sent unsolicited email or have colluded with other parties to disclosure private data which have been used for further spamm i ng.

In a further aspect there is provided means for choosing the most suitable combinatorial group testing algorithm, with which optimum classifying spying routes may be constructed, each said route containing information about all the target hosts within a respective subgroup of hosts.

Any feature in one aspect of the invention may be applied to other aspects of the invention, in any appropriate combination. In particular, apparatus features may be applied to method features and vice versa.

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which: Figure 1 is a schematic illustration of an embodiment of an apparatus for identifying an insecure host amongst a set of hosts or nodes; Figure 2 is a flow chart representing operation of the embodiment of Figure 1; Figure 3 is a schematic illustration of the sending of spam messages using data from the insecure host of Figure 1; and Figure 4 is an illustration of a Fano plane.

In the discussion that follows, certain of the described embodiments relate to identification of an insecure host or hosts in a set of hosts, by way of example. Hosts are nodes that host a particular application or service, for example a webpage or website. Further embodiments relate, more generally, to any set of uniquely addressable network locations, referred to as nodes, and further embodiments are identical to those described below but with the hosts being replaced by any other type of node.

Figure 1 is a schematic illustration of an apparatus for identifying one or more insecure hosts or nodes.

The system comprises an investigating processor 2 connected to a set of nodes, each comprising a respective host 4 6 8 10 12 14 16 under investigation via a network.

Typically the network is the internet, but may be any kind of wired or wireless network.

The investigating processor 2 includes a decoy processing module 18, which is operable to generate decoy spy entities each incl uding a respective decoy e-mail address, and an processing module 20 that is configured to monitor activity at the decoy e-mail addresses and to analyse that activity in order to identify non-secure hosts.

The decoy spy entities in the preferred embodiment are mobile software agents that pretend to be c-commerce entities originating from private users. Such mobile software agents are executable files containing software code which can be executed by a host, or node, in a network. Each spy entity contains a decoy personal e-mail address and other private data that may be (mis-)used for marketing reasons. Each spy entity also incorporates a routing plan that contains all the hosts that this spy entity intends to visit and to which it will expose its private data.

The internal structure of the agent can however be organised in any suitable manner.

The Foundation for Intelligence Physical Agents (FIPA), www.fipa.org, provides

specifications for generic agent technologies.

In the preferred embodiment, the decoy e-mail address and the private data within each decoy spy entity has the following properties: * they represent temporary pseudo personal information that would be, in reality, a trap, in sense that they should be set-up solely for the purposes of the testing purposes, without being associated with any real person * they are unique and different to all other private data used by any other decoy spy entity * they appear to be normal personal private data (e.g. JSmithyahoo.com as the decoy email address).

The agents are legitimate in the sense that they are intended for interacting with the hosts in a defined way, and the hosts expect to deal with such agents. In the example of Figure 1, the hosts are on-line retailer hosts, and the software agents are configured to request a price for a particular item from each of the hosts. The agents then return to the originator, the investigating processor 2, with prices from a number of different retailers.

The decoy e-mail addresses are hosted by the investigating processor 2, and are indicated schematically by the hatched boxes in Figure 1. In variants of the preferred embodiment, the decoy e-mail addresses are hosted by different processors or servers distinct from the investigating processor 2, which may provide an additional layer of security and may make it more difficult for an insecure host to detect that it may be under investigation.

The operation of the system of Figure I is represented in overview by the flow chart of Figure 3. In operation, each of the decoy entities is sent from the investigating processor 2 to a respective subset of the set of hosts under investigation, via a pre-defined route. The agents are forwarded from one host to another in the network using standard network transport protocols, such as TCP/IP in the case of the internet.

Each decoy entity interacts with each host of the subset of hosts on its route before being passed on to the next host on the route, and finally being returned to the investigating processor 2. Each host may either legitimately or illegitimately record the decoy e-mail address or other data from the or each decoy entity with which it interacts, and then may use that e-mail address or other data either legitimately or illegitimately.

Illegitimate use may comprise sending spam e-mail to the decoy e-mail address in question or by passing the decoy e-mail address, either knowingly or unknowingly, to a third party who subsequently sends a spam e-mail to the decoy e-mail address in question.

As each host is able to determine that the agent will be sent to other hosts in addition to itself, an insecure host may be lured to act insecurely as it may conclude that any insecure behaviour could not be attributed to it rather than the other hosts.

It should be noted that a malicious host may not send spam itself, but may instead collude with other hosts, and contribute to such actions implicitly, or even unwillingly (for instance a host may leak data to third parties due to inappropriate storage security).

We assume that is also an unsolicited action and hence such collusions are considered as infringements.

Following the sending of the decoy entities, the processing module 20 of the investigating processor monitors any activity at the decoy e-mail addresses occurring during a pre-determined period of time.

In the example shown in Figure 1, the first and second of the hosts 4 6 are insecure. It can be seen from Figure 1 that the first and second of the hosts 4 6 are included in the subset of hosts of two of the decoy entities, whose routes are indicated by the solid and dashed lines on Figure 1.

Subsequently the e-mail addresses of the two decoy entities are passed by the insecure host 6 to a spammer 30, as illustrated in Figure 2. The sparnmer 30 sends spam e-mails to the two decoy entity addresses 32 34, as shown by solid arrows in Figure 2.

An infringement is detected by when unsolicited email is received at one or more decoy e-mail addresses, which may in further contain information linked to the private data that a spy entity has been provided with, or, even worse, it may carry viruses or maiware. The assessment of such outcomes is performed by the processing module 20 of the investigating processor 2 in a secure environment (for securely identifying and dealing with attached viruses, malware, etc).

The decoy entities and their routes are selected such that if any sign of security violation is detected at a decoy e-mail address (e.g. spam email, viruses), such security violation can only be accredited to one (or more) remote host(s) within the subset of hosts to which the corresponding spy entity has been exposed. Thus, by appropriate selection of the decoy entities and their respective routes, the processing module 20 is able to determine unambiguously any insecure hosts from collective analysis of the security violations identified at the decoy e-mail addresses.

It has been found that group testing algorithms, and more especially non-adaptive group testing (NGT) algorithms, are particularly well suited for addressing the problem of selecting routes for the decoy entities so as to enable unambiguous identification of insecure hosts, as discussed in more detail below. The processing module 20 of the preferred embodiment is configured to perform analysis of the security violations detected at the decoy e-mail addresses using such a non-adaptive group testing algorithm.

The routes followed by the decoy entities of Figure 1 are represented in table 1, which is an incidence matrix M of the group testing algorithm used in the example of Figure 1.

L C2 G3 G4 G C'8 C R1 1 1 1 0 0 0 0 I?2 1 0 0 1 1 0 0 R3 1 0 0 0 0 1 1 R 0 1 0 1 0 1 0 R5 0 1 0 0 1 0 1 R6 0 0 1 1 0 0 1 R7 0 0 1 0 1 1 0 2-(7, 3, 1) incidence matrez

Table I

The rows of the table, labelled R1, represent routes (tests) and the columns represent the target hosts, labelled C, included in the routes.

It can be seen that the first three routes (R1, R2, R3) are represented schematically on Figure 1 by the solid, dashed and dashed-dotted lines, by way of example. The other routes (R4, R5, R6, R7) are not represented on Figure 1 for reasons of clarity.

The group testing algorithm used to select the routes in the example of Figure 1 is 2-disjunct, meaning that it is able to reliably detect up to two insecure hosts in the set of hosts. If three or more insecure hosts are present in the set of hosts, application of the algorithm by the processing module 20 would produce an incorrect result.

For example, if only the first two hosts are malicious as is the case for the example of Figure 1 then an outcome vector output by the processing module 20 from analysis of the results after the predetermined period of time is A = {1,l,1,1,1,0,0}, in which each entry corresponds to a respective decoy e-mail address, a 1 indicates that activity has been detected at the decoy e-mail address (a positive test) and a 0 indicates that no activity has been detected at the decoy e-mail address (a negative test). The insecure hosts are the ones not appearing in all negative tests (R6 and R7), and it can be seen from Table 1 that they are the first two hosts 4, 6, as shown schematically in Figure 2.

However, if, say, the first three hosts are defectives, then the outcome vector would be A = {1,1,1,1,1,1,l}. The decoding by the processing module 20 using the group detection algorithm would determine that all hosts are insecure, which is false. The falsity of the outcome, however, can be correctly be identified by the processing module since no more that two insecure hosts are expected to be present.

The processing module 20 is configured to output a signal if the results of the analysis seems to indicate that more than two insecure hosts are present, as in that case the group testing algorithm that has been applied is not suitable for the number of insecure hosts.

The testing can then be reconfigured in response to the signal, either by selecting a different set of hosts, expected to contain no more than two insecure hosts, or by using a different group testing algorithm suitable for detecting the presence of three or more insecure hosts.

in a variant of the preferred embodiment, the processing module is configured to process the results of the monitoring periodically, or in response to activity at the decoy addresses. The number of insecure hosts is re-evaluated each time. If the maximum number of insecure hosts is reached then the processing module 20 is configured to indicate that the design is not adequate and to halt the monitoring and processing.

In the example shown in Figure 2, the processing module would be able to determine unambiguously which of the hosts are insecure, as there were only two insecure hosts in the group and a 2-disjunct incidence matrix was used.

The anti-spamming security evaluation is based on collective outcomes (i.e. detectable impacts), which could either be negative (if there is no sign of violation), or positive (if, for instance, e-mails are received at one or more of the decoy addresses). An important feature of the embodiments, is the ability to identify which remote hosts have been responsible (either directly or indirectly) for received spam emails, given any particular set of outcomes, for instance any particular combination of decoy e-mail addresses having received e-mails.

In the preferred embodiment described in relation to Figure 1, decoy information is provided to the hosts under investigation by way of software agents. It should be noted that in variants of the preferred embodiment, traditional client-server based interactions rather than software agents may be used in order to disseminate the decoy information to the hosts under investigation.

In such variants, it is assumed that the hosts collude in order to share their information for their shared benefits under the following condition: If there is an insecure host within the group of insecure hosts, he will know which of the private information he has is shared with which other hosts, but he will not be able to acquire other information that he doesn't already have. In such a situation, as is the case when software agents are used, the insecure hosts may be lured to act insecurely as they would be aware that the decoy address or addresses that they have received were also known by other hosts, and they may presume that any such insecure action could not be attributed to them in particular.

Another variant of the preferred embodiment assumes that email messages are sent from decoy email accounts, each email message being addressed to the email addresses belonging to a subgroup of network hosts/nodes. In such a situation, a potentially insecure host may be lured to misuse a decoy email address, considering that it is aware that this email address has also being exposed to other hosts/nodes.

In the preferred embodiment described in relation to Figure 1, the use of one particular group testing algorithm has been described. However, it is a feature of the preferred embodiment that any one of a large number of group testing algorithms may be used to select the routes of the decoy entities. The particular group testing algorithm to be used is, in the preferred embodiment, selected in dependence upon criteria that have been found to provide desired security and efficiency. The selection of a suitable group testing algorithm is now described in more detail.

Selection of group testing algorithms The decoy entities, and their respective routes, represented by the group testing algorithm, are selected using group testing theory, which addresses the problem of identifying efficiently the defectives in a (large) population of items containing a small set of defectives, as discussed in more detail below.

In order to construct anti-spam spy entity routes for optimum assessments, it is assumed that: * A decoy is exposed to a certain subset of target platforms in an order that is of no particular significance. A subset of target nodes to be visited defines a route.

* A target platform is aware of all the routes of all the spies that visit it.

* An untrustworthy spy agent will always perform an act which will enable the sender of the agent to detect that it has visited a malicious platform. That is, a spy agent visiting a malicious node will always yield a positive outcome.

We consider spying routes as unordered sets of nodes that spies visit, where each set returns "positive" when it contains at least one malicious node, and "negative" otherwise. The problem of identifying the malicious (defective) nodes, given a number of such sets, is identical to the classic group testing problem.

Security is a fundamental factor that drives the design of optimal sets of spy agent routes. The following spying objectives are considered: * Target nodes should be incapable of deriving whether they are dealing with spies or not; * Target nodes should be given motives to misbehave by using spies as "baits".

From the above requirements, the following (uncorrelated) optimisation criteria have been derived:- (Sec-i) The assumption we have made that it is always evident when a spy visits at least one malicious node, satisfies the general spying requirements. We argue that the more target nodes a spy agent visits, the greater the likelihood that this assumption will hold and, subsequently, the more reliable the tests are; (Sec-2) We also argue that the reliability of an outcome is improved when the number of other con-u-non nodes that are visited by two or more spy agents that also visit the chosen target node is minimised. Thus reliability can be maximised by having any two spies visiting no more than one common target node.

The following criteria are also identified in order to provide an efficient solution:- (Eff-1) Optimality in terms of efficiency is achieved if a maximum number of target nodes is tested using a given number of spy agents; (Eff-2) A certain group testing algorithm may be limited by the number of defectives it may be able to successfully identify. An alternative optimization strategy suggests maximising the maximum number of malicious nodes that can be detected successfully within a given number of target nodes.

It has been found that the spy optimization problem is version of a classic group testing optimization problem, modified by the inclusion of some additional security optimization criteria.

In the preferred embodiment, non-adaptive Group Testing (NGT) is used. Non-adaptive group testing may be more suitable for use in addressing the problem of identifying insecure nodes than sequential algorithms, as the time needed for testing in order to obtain a correct, unambiguous result cannot be known exactly, and may be relatively long. For example if a target node violates the confidentiality of the spy's email and uses it for spamming or marketing purposes it may take some days or weeks before a detectable impact of such violation is obtained.

The use of a sequential/adaptive algorithm may lead to inconsistencies, if the time needed for testing is underestimated. For example, if a sequential/adaptive algorithm is used and a test T, according to the algorithm is assessed as being negative after a certain time, a further test T1-1-is then carried out, but if the time for testing has been underestimated and spam for test T1 is then received, the following test T� and all subsequent tests will have been invalidated.

It is noted that the expectation that nodes process all spies consistently according to their genuine intentions is more likely to be valid when the time frame of the testing is relatively narrow.

A more formal analysis of the selection of decoy entities and their routes is now provided.

The following notation is used:-S: The set of all target nodes we wish to test n = SI: The cardinality of S d: The maximum number of spammers, S may contain R: is a set of subsets of S We define a spy route design, D, as a set system D = (S,R).

The route design is uniform if all routes contain the same number of nodes. The cardinality of R, RI, represents the total number of spies (R{R1}). IRI represents the number of target platforms the decoy spy R1 is exposed to. A spy, R1, can be seen as a binary subset of indices of target nodes. Similarly a target node, C1, can be seen as a binary subset of the indices of the spies it meets.

A spy NGT scheme can be represented as an incidence matrix M = (m1) where rows are labelled by routes (tests), R, and columns by target nodes, C,. Thus, m1 = 1 if route i contains node j, and m = 0 otherwise. Table I described above in an example of an incidence matrix.

The outcome of all tests can be represented by a vector A, where A1 = 1 if R contains a defective C, and A = 0 otherwise. The union of all defective columns gives the outcome vector.

The main spy group testing problem is then to define an incidence matrix M that will identify all d defectives, regardless of how they are distributed. In that case M is said to be a d-classifier.

Using the above notation, the spy NGT objectives set out above can be expressed as follows: Given S, d, n, find a d-classifier matrix M such as: (Sec-I): For given other parameters, maximize R11.

(Sec-2): The intersection of any two R1 and R is less than 1.

(Eff-1): For given other parameters, maximize n.

(Eff-2): For given other parameters, maximize d.

in order to present such a construction, three more definitions are required:- * M is classified as d-disjunct if the union of any d columns does not contain another * A block design is an ordered pair (V,B) where the set B is a collection IBI = b subsets (blocks) of the set V and each element of V is contained in r different blocks * A t-(u,k,X) design is a block design (V,B), where VI = u, for any block B in B, B(=k, and any t-subset of V occurs in exactly X blocks.

It has been shown in W.Kautz and R.Singleton, IEEE Transactions on Information Theory, 10(4), 363-377, 1964 that a d-disjunct M is a d-classifier and in this case all items not appearing in negative R1 are all defective.

There is a plethora of diverse techniques for constructing disjunct matrices known in the art, see for instance H.Q.Ngo and D.Z.Du, Discrete Mathematical Problems with Medical Applications, DIMACS Ser. Discrete Math. Theoret. Comput. Sci. 55, 171- 1 82. In regards with the fact that this is still largely an incomplete mathematical theory, we consider for our anti-spam spies constructions based on 2-designs, for the following reasons: The second security requirement (Sec-2) characterizes 2-(u,k,l) designs (by definition).

It is shown in D.J.Balding and D.C.Torney, Journal of Combinatorial Theory Series A, 74(1), 131-140, 1996 that optimal pooling designs are equivalent to maximum-sized collections of C such that no column within it is contained in the union of d others.

F.K.Hwang and V.T.Sos. Studia Sci. Math. Hungar, 22, 257-263, 1987 showed that this requirement characterizes d-complete designs. Also it is shown in P.Erdos, P.Frank and Z.Furedi, Journal of Combinatorial Theory A, 33, 158-166, 1982 that for a small number of defectives and more precisely for d�=2 there are cases where 2-designs are optimal.

Even though the theory of existence of 2-designs is incomplete, there is, fortunately, a wealth of known results, as described for instance in C.J.Colbourn and J.H.Dinitz, The CRC Handbook of Combinatorial Designs, CRC Press, 1996.

Our construction is based on the following theorem: An incidence matrix, M, of a 2-(u,k,1) design where the blocks are used as columns and the elements as rows is (K-l)-disjunct. Hence, a 2-(u,k,1) design is a d-classifier with d=k-1.

For a 2-(u,k,X) design, the following equations apply: bu(u-1)X/(k(k-])) r=)(u-1)/(k-1) Putting it all together, the NGT anti-spam spy routing construction may be defined using the following algorithm:- Find a 2-design 2-(u,k,l) such that:-Sec-i: Given other parameters, maximize (u-i)/(k-I)=r.

Eff-l: Given other parameters, maximize u(u-I)/(k(k-1))b.

Eff-2: Given other parameters, maximize k=d+l It should be noted that the first two requirements, Sec-i and EfI-l, are similar. That can be understood qualitatively as a larger number of blocks (i.e. target platforms) implies a larger number of blocks appearing in an element (i.e. cardinality of a route), which serves both the requirement for efficiency and more reliable results. The third requirement, Eff-2, further demands the maximisation of d, which contradicts with the other requirements. A large number of defectives will compromise both the efficiency and the reliability.

Each one of these optimisation strategies should be selected according to given circumstances/parameters. The optimum group testing algorithm would then be obtained according to the selected optimisation strategy.

It should once again be noted that the theory of existence of 2-designs is not complete.

However, there is a rich collection of known such designs. in the preferred embodiment a design to be used in selecting the decoy entities and their routes may be selected from a table of known 2-designs.

Example 1: Demonstration of a classifying design We now present a more detailed formal discussion of the NGT construction used in the example of Figure 1, and represented in table 1, and which is a simple and fairly straightforward NGT construction.

Finite projective planes are an interesting special case of 2-designs. The smallest finite projective plane (of order two) is known as the Fano plane (7,3,1) and it is a 2-design with v = b = 7 and k = r = 3. The associated decoy system yields 7 nodes, 7 decoys, 3 nodes per decoy (and 3 decoys per node). The Fano plane can be seen in Figure 4. Note that each point has 3 lines on it and each line contains 3 points.

There exists only one block design corresponding to the Fano plane, which is the following: (V;B): B= {B1, B2, B3, B4, B5, B6, B7} B1 = {sl,s2.s3} B2 = {sl,s4,s5} B3 = {sl,s6,s7} B4 = {s2,s4,s6} B5 = {s2,s5,s7} B6 = {s3,s4,s7} B7 = {s3,s5,s6} The columns of the incidence matrix are then constructed from blocks of the Fano plane, giving the incidence matrix M shown in table 1.

Mapping this design to our routing strategy, we may easily determine which nodes (B1) each decoy (s1) should visit. Each decoy spy should carry a unique pseudo personal email address and private data.

Example 2

Suppose that we need to test a number of online retailers, where people store sensitive information such as email addresses, credit cards, product preferences, etc. We assume that we explicitly never consent to these retailers giving such private data to other third parties. We consider one of the following two cases: 1) We use mobile agents to expose instances of such decoy information to combinations of retailers. As such, each retailer may see the list of other retailers a visiting mobile agent is bound to migrate and be exposed to. We make sure that any retailer will never be visited by two decoy mobile agents that will visit another common retailer. Hence, each malicious retailer will be lured to misbehave.

2) We use traditional client-server based interactions in order to disseminate our decoy information to the online retailers. We assume that the retailers collude in order to share their information for their shared benefits. However, in such cases a retailer does not wish to expose to any other retailer a secret that the latter does not have. The first retailer only wishes to know which of the private data it has is common with the other retailer, without revealing anything. This is possible and can be leveraged with cryptographic protocols using privacy homomorphism (PH), such protocols being described for instance in A.C-F.Chan, 1NFOCOM 2004, 4, 2414-2424, 2004.. Hence, if there is a malicious retailer within that group, he will know which of the private information he has is shared with which other retailers, but he will not be able to acquire other information that he doesn't already have. Similarly with case 1, the malicious retailer that thinks that he is "covered", will be lured to misbehave.

3) We use our set of decoy email accounts to directly email subgroups of the online retailers.

We assume that past security analyses show that the number of retailers per decoy data should be at least 5, in order to maintain a proper lure. Hence: r -, (u-1)/(k-1) 5 -u �= 5k-4 Assuming that there are no more than 2 malicious retailers (if there is more than one the falsity of this assumption will be revealed from the results): k=d+13 -u �=1 1.

We may use the (15,35,7,3,1), (u,b,r,k,X) 2-design, which satisfies our requirements in this example. This design is known to have at least 80 pair-wise nonisomorphic 2-designs.

In such scenarios, the detection algorithm may want to wait for about a month before assuming that the results he gets (from analysing the decoy email inboxes) are mature enough. However one benefit of using Non-adaptive Group Testing is that even if all of a sudden spam is received in a clean' (negative) decoy email after, say, two months, no extra tests would be required and the analysis could be carried out on the results at that time.

It will be understood that the present invention has been described above purely by way of example, and modifications of detail can be made within the scope of the invention.

Each feature disclosed in the description, and (where appropriate) the claims and drawings may be provided independently or in any appropriate combination.

Claims

CLAIMS: I. A method of identifying one or more insecure nodes amongst a set of nodes, comprising:-providing each of a plurality of decoy entities to a respective subset of the set of nodes, each decoy entity comprising a respective decoy address; monitoring the decoy addresses; and processing the results of the monitoring to determine the combination of decoy addresses that have received at least one message and to identify a node or nodes as being insecure consistent with that combination.
2. A method according to Claim 1, wherein at least one of the subsets of nodes comprising a plurality of nodes, and preferably each of the subset of nodes comprises a respective plurality of nodes.
3. A method according to any preceding claim, wherein each decoy entity comprises respective data that enables each of the nodes in the subset of nodes for that decoy entity to become aware of the identity of the other nodes in the subset of nodes.
4. A method according to any preceding claim, wherein the respective subset of nodes for each decoy entity is selected in accordance with a group testing algorithm, and the group testing algorithm is preferably a non-adaptive group testing (NGT) algorithm.
5. A method according to Claim 4, comprising selecting the group testing algorithm in dependence upon at least one characteristic of the set of nodes.
6. A method according to any preceding claim, wherein the subsets of nodes are such that the method is able to identify correctly up to a pre-determined number of insecure nodes, and preferably the method further comprises generating a signal if more than the pre-determined number of insecure nodes is identified.
7. A method according to any preceding claim, comprising selecting the respective subset for each of the plurality of decoy entities so as to maximise the number of nodes in each subset and/or so as to minimise the number of nodes in common between any pair of the decoy entities and/or so as to maximise the number of insecure nodes that can be identified correctly.
8. A method according to any preceding claim, wherein the subsets of nodes are such that the respective subset of nodes for each of the decoy entities has no more than one node in common with any other of the subsets of nodes.
9. A method according to any preceding claim, wherein the monitoring is performed for a pre-determined period of time, and the processing is performed on or after expiry of the pre-determined period of time.
10. A method according to any preceding claim, wherein each entity comprises a software agent or a data message, or an e-mail message.
11. Apparatus for identifying one or more insecure nodes amongst a set of nodes, comprising:-means for providing each of a plurality of decoy entities to a respective subset of the set of nodes, each decoy entity comprising a respective decoy address; means for monitoring the decoy addresses; and processing means configured to determine the combination of decoy addresses that have received at least one message and to identify a node or nodes as being insecure consistent with that combination.
12. Apparatus according to Claim 11, wherein at least one of the subsets of nodes comprises a plurality of nodes, and preferably each of the subset of nodes comprises a respective plurality of nodes.
13. Apparatus according to Claim II or 12, wherein each decoy entity comprises data that enables each of the nodes in the subset of nodes for that decoy entity to become aware of the identity of the other nodes in the subset of nodes.
14. Apparatus according to any of Claims 11 to 13, further comprising means for selecting the respective subset of nodes for each decoy entity according to a group testing algorithm, and the group testing algorithm is preferably a non-adaptive group testing (NGT) algorithm.
15. Apparatus according to Claim 14, comprising means for selecting the group testing algorithm in dependence upon at least one characteristic of the set of nodes.
16. Apparatus according to any of Claims 11 to 15, further comprising means for generating a signal if the apparatus identifies more than the pre-determined number of nodes as being insecure.
17. Apparatus according to any of Claims 11 to 16, further comprising means for selecting the respective subset for each of the plurality of decoy entities so as to maximise the number of nodes in each subset andlor so as to minimise the number of nodes in common between any pair of the decoy entities andlor so as to maximise the number of insecure nodes that can be identified correctly.
18. Apparatus according to Claim 17, wherein the selecting means is configured to ensure that the respective subset of nodes for each of the decoy entities has no more than one node in common with any other of the subsets of nodes.
19. Apparatus according to any of Claims 11 to 18, wherein the monitoring means is configured to monitor the decoy addresses for a pre-determined period of time, and the processing means is configured to perform the processing on or after expiry of the pre-determined period of time.
20. Apparatus according to any of Claims 11 to 19, wherein each entity comprises a software agent or a data message, or an e-mail message.
21. A computer program product storing computer executable instructions operable to cause a general purpose computer communications apparatus to become configured to perform a method in accordance with any one of claims 1 to 10 or to become configured as apparatus in accordance with any one of claims 11 to 20.