WO2010011182A2

WO2010011182A2 - Method and system for tracing a source of leaked information

Info

Publication number: WO2010011182A2
Application number: PCT/SG2008/000273
Authority: WO
Inventors: Onn Chee Wong; Siew Keng Loh; Hui Yang; You Liang Wang
Original assignee: Resolvo Systems Pte Ltd
Priority date: 2008-07-25
Filing date: 2008-07-25
Publication date: 2010-01-28
Also published as: WO2010011182A3

Abstract

Embodiments of the invention provide a method and a system of tracing a source of leaked information owned by an organization after the information has been leaked on an external network. An identity-relationship database is built, wherein the identity-relationship database contains information linking, either directly or indirectly, an employee of the organization to one or more network identities, and to network identities of others with whom the employee communicates. The leaked information on a site on the external network is located, and a network identity of the poster of the leaked information on the external network is determined. It is determined whether one or more links in the identity-relationship database connect the network identity of the poster of the leaked information to the employee. If the one or more links connect the network identity of the poster of the leaked information to the employee, the employee is identified as a possible source of the leaked information.

Description

Method and System for Tracing a Source of Leaked Information

Technical Field

[0001] Embodiments of the invention relate generally to a method and a system for tracing a source of leaked information owned by an organization after the information has been leaked on an external network.

Background

[0002] Information Leakage Detection and Prevention ("ILDP") is an emerging and fast- growing area in the field of information security. The business drivers to prevent information leakage have existed. However, due to the limitation of technological options in the past, organizations have been relying on measures with limited effectiveness. With information going digital and the growing prevalence of Internet access, the risk of sensitive corporate information / intellectual assets being leaked out poses a problem.

[0003] One common shortcoming of existing ILDP solutions is that they aim to protect every single valuable information, which leads to lengthy and laborious attempts to try to understand how every employee uses potentially sensitive information. Some ILDP solutions, especially those with client-side agents, require complex and time-consuming installation and configuration. Other conventional solutions require users to copy sensitive information to centralized locations, resulting in interruption to business users.

[0004] In addition, organizations generally do not know the data context and hence are not able to create the relevant rules. The general approach of the other ILDP solutions makes this problem worse by requiring the organizations to understand the data context fully. [0005] Most ILDP solutions do not possess context awareness and implement policies in a one-sided manner - by looking at the sender or source - without identifying who the recipients are. This further exacerbates the perception that ILDP obstructs, more than provide benefits to, business.

[0006] In addition, there is no existing ILDP solution that is able to detect information that is already leaked out to the Internet sites. With the increased popularity of Web 2.0 applications, the speed of spreading of information has increased, which makes timely discovery of public domain leakages more important.

[0007] Another shortcoming of the existing ILDP solutions is that there is no segregation of access to collected information from an administrator. This means all sensitive information that is captured by the ILDP system will be made available to the administrators.

[0008] Therefore, there is a need to provide a new method and system which overcome at least one of the above-mentioned problems.

Summary

[0009] In an embodiment, there is provided a method of tracing a source of leaked information owned by an organization after the information has been leaked on an external network. An identity-relationship database is built, wherein the identity-relationship database contains information linking, either directly or indirectly, an employee of the organization to one or more network identities, and to network identities of others with whom the employee communicates. The leaked information on a site on the external network is located, and a network identity of the poster of the leaked information on the external network is determined. It is determined whether one or more links in the identity-relationship database connect the network identity of the poster of the leaked information to the employee. If the one or more links connect the network identity of the poster of the leaked information to the employee, the employee is identified as a possible source of the leaked information. [0010] Another embodiment of the invention provides a system for tracing a source of leaked information owned by an organization after the information has been leaked on an external network. The system may include an identity-relationship database containing information linking, either directly or indirectly, an employee of the organization to one or more network identities, and to network identities of others with whom the employee communicates. The system may also include a network device connected to the identity- relationship database and to the external network. The network device may be configured to locate the leaked information on a site on the external network; determine a network identity of the poster of the leaked information on the external network; determine whether one or more links in the identity relationship database connect the network identity of the poster of the leaked information to the employee; and if so, identify the employee as a possible source of the leaked information.

Brief Description of the Drawings

[0011] In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the various embodiments. In the following description, various embodiments are described with reference to the following drawings, in which:

[0012] Fig. 1 shows a flowchart of tracing a source of leaked information owned by an organization after the information has been leaked on an external network according to an embodiment of the invention. [0013] Fig. 2A shows a flowchart of building an identity-relationship database according to an embodiment of the invention.

[0014] Fig. 2B shows a flowchart of a process for obtaining the actual identity of the sender of the digital communication according to an embodiment.

[0015] Fig. 3 shows a flowchart of building an identity-relationship database according to another embodiment of the invention.

[0016] Fig. 4 shows an exemplary identity-relationship graph according to an embodiment of the invention.

[0017] Fig. 5 shows a flowchart of a process for detecting the leaked information, e.g. sensitive source code, on network-accessible sites on an external network according to an embodiment of the invention.

[0018] Fig. 6 shows another embodiment of the invention to trace the source of the leaked information even if the network identity of the poster of the leaked information is not in the identity-relationship database.

[0019] Fig. 7A shows a system for tracing a source of leaked information owned by an organization according to an embodiment of the invention.

[0020] Fig. 7B shows a schematic diagram of an embodiment of the system 700 according to an embodiment

[0021] Fig. 8 shows a schematic diagram of a system implemented in a digital communication network according to an embodiment.

[0022] Fig. 9 A shows an exemplary piece of source code. Fig. 9B shows a table of elements extracted from the source code 900, classified as unique identifying elements or generic elements according to an embodiment. [0023] Fig. 10 shows a flowchart of a process for determining the set of unique identifying elements that identify the sensitive source code module accessed from the source code repository according to an embodiment of the invention.

Detailed Description

[0024] Various embodiments of the invention provide a method and a system for tracing a source of leaked information owned by an organization after the information has been leaked on an external network.

[0025] Fig. 1 shows a flowchart of tracing a source of leaked information owned by an organization after the information has been leaked on an external network according to an embodiment of the invention.

[0026] At 102, an identity-relationship database is built, wherein the identity-relationship database contains information linking, either directly or indirectly, an employee of the organization to one or more network identities, and to network identities of others with whom the employee communicates.

[0027] At 104, the leaked information on a site on the external network is located. [0028] At 106, a network identity of the poster of the leaked information on the external network is determined.

[0029] At 108, it is determined whether one or more links in the identity-relationship database connect the network identity of the poster of the leaked information to the employee. [0030] At 110, if the one or more links connect the network identity of the poster of the leaked information to the employee, the employee is identified as a possible source of the leaked information. [0031] The embodiment of the invention provides traceability of source of leakages by building a web of relationships between the online identities that leak the confidential information and internal user identities. The identity-relationship database may integrate identities captured within the internal network, e.g. within the organization, and identities captured from the external network, e.g. the Internet sites. The integrated identity- relationship database is critical for identifying sources of leakages to popular Internet sites, such as blogs, social networking sites and forums.

[0032] Fig. 2 A shows a flowchart of building an identity-relationship database according to an embodiment of the invention.

[0033] At 202, data from the organization's internal human resources database and network servers, the name of an employee and one or more internal network identities used by the employee are extracted.

[0034] At 204, the employee is linked to the one or more internal network identities in the identity-relationship database.

[0035] At 206, a message being sent over a network is intercepted. In an embodiment, the message may be an encoded or encrypted message, and the intercepting of the message including decoding the encoded message.

[0036] At 208, a network identity of the message sender is extracted from the message. [0037] At 210, a network identity of the intended recipient is extracted from the message. [0038] At 212, the network identity of the message sender is linked to the network identity of the intended recipient.

[0039] According to this embodiment, internal network identities are collected from internal sources within the organization. Network identities which are of interest to be extracted include the identities used by employees for accessing instant messaging, personal email, forums, social networking sites, blogs and other Web-based services (including, for example, Web 2.0 services). The collected internal network identities are linked to their respective users, i.e. the employees. The collected internal network identities may also be linked to the internal corporate identities by revolving against the organization's directory and network servers.

[0040] In this embodiment, the counter-part identities from the intercepted traffic, e.g. the intercepted message, may also be captured. These counter-part identities are the network identities of the intended recipient of the message, which may be identities within the internal network or identities in the external network. The network identities of the intended recipient of the message maybe considered as the first layer of friends of the internal user. [0041] In another embodiment, the building of an identity-relationship database further includes recording the frequency of communication between the network identity of the message sender and the intended recipient.

[0042] The above embodiment of building the identity-relationship database may be implemented in a network gateway device which is configured to intercept a digital communication, e.g. a message, and assessing whether the message contains sensitive information. The network gateway device may include a network traffic analyzer integrated with enterprise directory and Dynamic Host Configuration Protocol (DHCP) servers to obtain the real user identities from the captured internet protocol (IP) addresses or machine hostnames. For example, the network identity of a sender may be determined from the captured traffic's source IP address and content. Further, the reporting hierarchy, i.e. the reporting officer, for each user may also be extracted from the enterprise directory. This may facilitate the automatic escalation of detected incidents to the appropriate supervisor. [0043] Fig. 2B shows a flowchart of a process for obtaining the actual identity of the sender of the digital communication, e.g. the message, according to an embodiment of the invention. At 252, "User Name" is obtained from e.g. a MICROSOFT® ACTIVE DIRECTORY® technology Windows Event ID 672 with the source IP address of the captured digital communication. If "User Name" is not found at 252, the digital communication is checked to determine if it is of email type at 254, regardless whether the digital communication is native or web-based. If the digital communication is of email type, user identity is obtained at 256 by matching the extracted sender's email address against existing Identity-Relationship Database.

[0044] If the digital communication is not of email type, the digital communication is checked to determine if it is of instant messaging type at 258, regardless whether the digital communication is native or web-based. If the digital communication is of instant messaging type, user identity is obtained at 260 by matching the extracted sender login ID against existing identity-relationship database. If the digital communication is not of instant messaging type, the digital communication is treated as from "Unknown" user at 262 and relevant correlation rules may be applied accordingly by the network gateway device to prevent leakage of sensitive information.

[0045] Fig. 3 shows a flowchart of building an identity-relationship database according to another embodiment of the invention.

[0046] At 302, one or more sites on the external network are searched for a searched network identity in the identity-relationship database to produce search results.

[0047] At 304, a relationship between a connected network identity and the searched network identity in the search results is detected.

[0048] At 306, the connected network identity is linked to the searched network identity in the identity-relationship database.

[0049] In an embodiment, the searching of one or more sites on the external network may be performed by a web crawler, which is connected to the external network to search the one or more sites as will be described in detail in a later part of the description. In accordance with this embodiment, the second layer of friends and the subsequent layer of friends which are connected with the internal network identities and their first layer of friends may be constructed from the online relationships published on the one or more sites on the external network.

[0050] In an embodiment, the one or more sites on the external network include social networking sites. In another embodiment, the one or more sites on the external network include blog sites, discussion forum sites, or other sites that permit posting of messages or content. In a further embodiment, the one or more sites on the external network include external e-mail and/or instant messaging sites. Accordingly, the searching of the one or more sites includes the searching of the sites in the above embodiments. [0051] Fig. 4 shows an exemplary identity-relationship graph according to an embodiment of the invention.

[0052] A target unknown identity 402 is linked to its first layer of friends 404 and second layer of friends 406. The target unknown identity 402 may be an internal identity or an external identity in this example. In this example, the target unknown identity 402 is the network identity of a poster of the leaked information on the external network. An internal network identifier 408 is linked to the target unknown identity 402. By identifying the link from the target unknown identity 402 to the internal network identifier 408, the internal network identifier 408 may be identified as a possible source of the leaked information. [0053] According to an embodiment of the invention, the building of the identity- relationship database further includes determining a closeness of a connection between a first network identity and a second network identity.

[0054] In an embodiment, determining a closeness of a connection may include determining a type of a detected relationship, wherein each type of detected relationship is associated with a proximity value that is used to determine the closeness. The proximity value represents the distance between the two identities, as shown in Fig. 4. Examples of the type of the relationship that are used in the determination of a closeness of a connection may include but are not limited to:

1) the first network identity and the second network identity are declared friends on a social networking site;

2) the first network identity and the second network identity send personal communications to each other via instant messaging;

3) the first network identity and the second network identity send personal communications to each other via email;

4) hyperlinks exist between a blog of the first network identity and a blog of the second network identity;

5) the first network identity has posted a comment on a blog of the second network identity;

6) the first network identity and the second network identity have communicated via corporate email; and

7) the first and second network identities have both posted messages in the same thread on a blog and/or discussion forum.

[0055] In an embodiment, the proximity values for the above 7 types of detected relationships may be in a descending order to determine the closeness of two identities. [0056] For example, a pair of identities, who had communicated via personal email under type 3), has a higher proximity value and thus is closer than a pair of identities who had communicated via corporate email under type 6). In this example, the highest possible proximity is a declared friend as gathered from the social networking sites, whereas the lowest possible proximity is shared postings to a common message thread in online forums. [0057] It is to be noticed that other types of detected relationships other than the above 7 types may also be used in the determination of the closeness of a connection. [0058] In another embodiment, the determining a closeness of a connection may include determining a frequency of communication between the first network identity and the second network identity. As shown in Fig. 4, the frequency of communications may determine the thickness of the relationship links 410.

[0059] The degree of closeness between various identities may be ranked based on the type of detected relationships and the frequency of communications. The closeness of a connection between two network identities may be used in the identification of a possible source of the leaked information. In an embodiment, identifying the employee as a possible source of the leaked information includes using the closeness of the connections between the employee and the network identity of the poster of the leaked information to determine a likelihood that the employee is the source of the leaked information. [0060] In order to trace the source of the leaked information, the leaked information needs to be detected and located. This may be performed on a web crawler which is connected to the external network to search for leaked information. [0061] Fig. 5 shows a flowchart of a process for detecting the leaked information, e.g. sensitive source code, on network-accessible sites on an external network according to an embodiment of the invention. At 502, a set of unique identifying elements that identify a sensitive source code module accessed from a source code repository may be determined. At 504, a web crawler is connected to the external network to automatically search a list of one or more network-accessible sites for text that matches one or more of the unique identifying elements in the set of unique identifying elements, to provide search results. At 506, the search results may be collected in a memory of the web crawler. At 508, a relevancy for each of the search results may be determined, based at least in part on a number of the unique identifying elements that were matched and on a number of search results. At 510, the results may be sorted according to the relevancy. At 512, the results may be provided to a user, to indicate whether sensitive source code was found on the network-accessible sites. Detailed description of how to identify unique identifying element will be provided in a later portion of the description.

[0062] When a crawler server detects leakages of protected information according to the above process, the crawler server may capture profile of the poster of the digital information and profiles of all parties related to the poster. The profiles may include network identities of the users, and may be matched partially or completely when determining the source of the leaked information in accordance with an embodiment of the invention. [0063] Fig. 6 shows another embodiment of the invention to trace the source of the leaked information even if the network identity of the poster of the leaked information is not in the identity-relationship database.

[0064] At 602, the network identity of the poster of the leaked information is used as a target network identity.

[0065] At 604, one or more sites on the external network is searched for the target network identity to produce search results.

[0066] At 606, a relationship between the target network identity and a connected network identity in the search results is detected.

[0067] At 608, it is determined whether the connected network identity is in the identity- relationship database.

[0068] If the connected network is in the identity relationship database, the target network identity is linked to the connected network identity in the identity-relationship database at 610.

[0069] If the connected network identity is not in the identity relationship database, the process is repeated using the connected network identity as the target network identity at 612. [0070] In accordance with the above embodiment, the target network identity is the target unknown identity 402 in Fig. 4 above. The layers of identities linked to the target network identity are determined until a network identity which is linked to the target network identity and which is in the identity relationship database is determined. This may establish a link between the poster of the leaked information on the external network and network identifies in the identity-relationship database.

[0071] The above embodiment of searching the connected network identity linked with the network identifier of the poster of the leaked information and determining the link from target network identifier to a network identifier in the identity-relationship database may be carried out by a web crawler. In an embodiment, the web crawler is configured to detect leakage of protected information on the external network. The detection of leakage of protected information may be in accordance with the embodiment described in Fig. 5 above. [0072] When the leaked information is detected by the web crawler, the online/network identity of the person leaking the information is captured, which may be the target network identity as described above. The identity-relationship database may be checked to determine whether there is a link between the target network identity and an internal identity so as to trace the possible source of leakage. However, if the target network identity is not included in the identity-relationship database, the web crawler may be configured to build the linked network identities, such as the layers of friends, for the target network identity. The information used to build the layers of friends may include but is not limited to online forums, social networking sites and blogs.

[0073] In an embodiment, when there is a match between the "layers of friends" for the online identity who leaked the protected information and those for internal users, the "layers" are immediately merged and the online identity is immediately linked to the relevant internal users. If there is still no match beyond a threshold number of "layers", the closest but non- linked layers may be shown to the administrator for manual evaluation and judgement. [0074] In the identity-relationship as described in the above embodiments, the contained links may be collapsed so that all links between the employee and the network identities of others are direct links.

[0075] In accordance with the above embodiments, if an employee is identified as a possible source of the leaked information, an alert may be sent to a management device. [0076] Another embodiments of the invention relates to a system for tracing a source of leaked information owned by an organization after the information has been leaked on an external network. The system may include an identity-relationship database containing information linking, either directly or indirectly, an employee of the organization to one or more network identities, and to network identities of others with whom the employee communicates. The system may also include a network device connected to the identity- relationship database and to the external network. The network device may be configured to locate the leaked information on a site on the external network; determine a network identity of the poster of the leaked information on the external network; determine whether one or more links in the identity relationship database connect the network identity of the poster of the leaked information to the employee; and if so, identify the employee as a possible source of the leaked information.

[0077] Fig. 7A shows a system for tracing a source of leaked information owned by an organization according to an embodiment of the invention.

[0078] The system 700 includes an identity-relationship database 702, and a network device 704 connected to the identity-relationship database 702 and to the external network 708 in accordance with the embodiment described above. The identity-relationship database 702 may be built and the network device 704 may be configured to trace the source of leaked information according to the embodiments described above.

[0079] In an embodiment, the identity-relationship database 702 may include data linking the employee to the one or more internal network identities in the identity-relationship database. The data may be extracted from the organization's internal human resources database and network servers. In another embodiment, the identity-relationship databse 702 may include data linking a network identity of a message sender to a network identity of an intended recipient, which is extracted from an intercepted message being sent over a network. In a further embodiment, the identity-relationship database 702 includes data on the frequency of communication between the network identity of the message sender and the intended recipient.

[0080] The system 700 may also include a web crawler configured to seach one or more sites on the external network for a searched network identity in the identity-relationship database to produce search results, detect a relationship between a connected network identity and the searched network identity in the search results, and link the connected network identity to the searched network identity in the identity-relationship database. The network device 704 may be used as the web crawler, or an additional device may be used as the web crawler. For ilustrative purpose, the network device 704 is used as the web crawler to perform the funciton of the web crawler in this example, and the network device and the web crawler may be exchangeably used in the decription below.

[0081] The web crawler 704 may be implemented as a processor, as dedicated hardware, or as a software module, executing along with other software modules in the web crawler 704. Besides the above embodiment to determine the identity-raltionship for the searched network identity, the web crawler may also be configured to locate the leaked information on the external network and trace the source of the leaked information. [0082] The identity-relationship database 702 may further include data relating to a closeness of a connection between a first network identity and a second network identity, as described in the embodiments above.

[0083] If the network identity of the poster of the leaked information is not in the identity-relationship database 702, the web crawler 704 may be configured to use the network identity of the poster of the leaked information as a target network identity, search one or more sites on the external network for the target network identity to produce search results, detect a relationship between the target network identity and a connected network identity in the search results, determine whether the connected network identity is in the identity- relationship database. The web crawler 704 is also configured to link the target network identity to the connected network identity in the identity-relationship database, if the connected network identity is in the identity relationship database. Otherwise, the web crawler 704 is configured to repeat the process with the connected network identity as the target network identity.

[0084] The system 700 may also include a management device 706. The web crawler 704 may be configured to send an alert to the management device 706 if the employee is identified as a possible source of the leaked information.

[0085] The identity-relationship database 702, the web crawler 704 and the management device 706 may be connected to an internal network 710.

[0086] Fig. 7B shows a schematic diagram of an embodiment of some components of the system 700 according to an embodiment.

[0087] Portions or modules of the system 700 may be implemented by a computer system. The computer system includes a CPU 752 (central processing unit), a memory 754, a network interface 756, a clock 758, and input/output devices such as a display 762 and a keyboard input 764. All the components of the computer system 752, 754, 756, 758, 762, 764 are connected and communicating with each other through a computer bus 760. [0088] For example, in some embodiments the memory 754 may be used as the identity- relationship database as explained above. The memory 754 may include more than one memory, such as RAM, ROM, EPROM, hard disk, etc. wherein some of the memories are used for storing data and programs and other memories are used as working memories. [0089] The memory 754, when used as the identity-relationship database, may be configured to store the data linking the network identities as described above. The memory 754 may also be configured to store the instructions for building the identity-relationship database as described in the embodiments above, and the instructions for trace the source of the leaked information owned by an organization according to the embodiments above. In some embodiments, the CPU 752 may be used as the network device (the web crawler) 704 and/or the management device 708 as described in Fig. 7A above, and may be connected to an internal network (e.g. a local area network (LAN) or a wide area network (WAN) within an organization) and/or an external network (e.g. the Internet) through the network interface 756.

[0090] The instructions, when executed by the CPU 752, may cause the CPU 752 to build an identity-relationship database containing information linking an employee of the organization to one or more network identities and to network identities of others with whom the employee communicates, locate the leaked information on a site on the external network, determine a network identity of the poster of the leaked information on the external network, determine whether one or more links in the identity-relationship database connect the network identity of the poster to the employee, and if so identify the employee as a possible source of the leaked information. [0091] Fig. 8 shows a schematic diagram of a system 800 implemented in a digital communication network 802 according to an embodiment. The system 800 may include a network gateway device 804, a management device 806 and a web crawler 808. In some embodiments, the network gateway device 804, the management device 806 and the web crawler 808 may be implemented as a computer system similar to the computer system 700 in Fig. 7B above. In different embodiments, the system 800 may comprise different components and the number of components for the system 800 may also vary. [0092] The network gateway device 804 may be configured to analyze the digital information transmitted over the network and may apply relevant policies to a digital communication for preventing leakage of sensitive information. As the network gateway device 804 is used to protect against leakage of sensitive information, the network gateway device 804 may be considered as a protecting device, which may be named as "iProtect" device. The network gateway device 804 may intercept the digital communication being sent from an internal network to an external network. The network gateway device 804 may include a correlation engine for evaluating a security risk of the digital communication, a source code detection module for detecting source code in the digital communication, and a network traffic analyzer for analyzing the digital communication and determine the network identity of the user according to the embodiment as described in Fig. 2B above. In accordance with an embodiment of the invention, the network gateway device 804 may be used to build the identity-relationship database by determining network identities in an internal network, e.g. within an organization. In different embodiments, the network gateway device 804 may have different parts and the number of parts of the network gateway device 804 may also vary. [0093] The web crawler 808 of the system 800 may be configured to search Internet sites for leakages of information, and may be configured to build the identity-relationship database by searching and determining network identities in an external network, e.g. the Internet, as described in the embodiment of Fig. 3 above. For example, in some embodiments, the web crawler 808 is the network device 704 as described in Fig. 7A above, which is configured to search and detect leaked information in an external network, determine the network identities from the external network, and/or trace a source of leaked information according to the above embodiments. The web crawler 808 may be named as "iGather" in the system 800. [0094] The network identities determined from the internal network by the network gateway device 804 and the network identities determined from the external network by the web crawler 808 may be linked and built into the identity-relationship database, which may be saved in a database 810 of the system 800.

[0095] The management device 806 may be a management and administration tool that can be used to control the network gateway device 804 and the web crawler 808, and to provide management reports. The system may comprise a plurality of the management devices 808 to provide scalability.

[0096] The process of detecting and locating leaked information on a site on the external network by the web crawler 808 above will be described in detail below. [0097] The web crawler 808 may be configured to search network-accessible sites for leaked source code according to the embodiment described in Fig. 5 above. The system 800 above may include a source code repository 812 that may store one or more source code modules and may be connected to an internal network. The web crawler may be configured to determine a set of unique identifying elements that identify a sensitive source code module accessed from the source code repository. The web crawler may search a list of one or more network-accessible sites for text that matches one or more of the unique identifying elements in the set of unique identifying elements, to provide search results. The web crawler may also collect the search results in a memory of the web crawler or of the system. [0098] Further, the web crawler may determine a relevancy for each of the search results based at least in part on a number of the unique identifying elements that were matched and on a number of search results, and may sort the results according to the relevancy. The web crawler may send the results to the management device 806, to indicate to a user whether sensitive source code was found on the network-accessible sites.

[0099] The web crawler may provide active monitoring and detection of leakages to the external network. The web crawler may operate by automatically logging into one or more of the network-accessible sites and performing search-and-filter activities. These network- accessible sites may not be accessible to popular search engines. These network-accessible sites may be designated by a user of the system 800.

[00100] The search-and-filter activities performed by the web crawler may be broken down into a plurality of phases (e.g. two phases). An initial search phase may be performed to list out a summary of results ranked in order of relevance. Users can then review the summary results and instruct the web crawler to perform a more in-depth search of the selected initial results. Wherever possible, multiple search functions offered by the designated Internet sites may be utilized by the web crawler to provide more accurate and comprehensive searches. The above activities can be performed on demand by the administrators or as scheduled.

[00101] Inputs to the online search can be manually entered or automatically derived by the web crawler after accessing protected information repositories and evaluating the protected content. For example, the web crawler can automatically access a source code repository of an organization, extract the source codes, obtain the unique identifying elements of the extracted source codes and perform searches using the unique identifying elements. [00102] An exemplary piece of source code 900 named GeneralUtil.java is shown in

Fig. 9A. The exemplary source code 900 is used for illustrating the detailed process of obtaining unique identifying elements.

[00103] Initially, elements may be extracted from the source code 900. The elements extracted from the source code 900 may be categorized into a plurality of element types. The element types may include:

One-line comments;

Declared Package names (for programming languages which support this);

Method names;

Class names; and

File names.

Different element types may be used for categorizing the elements extracted from the source code in different embodiments. The number of element types may also be different in other embodiments.

[00104] Next, each of the elements extracted from the source code 900 may be checked, to determine whether it is a unique identifying element, using uniqueness rules. The uniqueness rules may include: a) Length of the element; and b) Whether the element is included in a blacklist of common/generic words. Different uniqueness rules may be used in different embodiments. The number of uniqueness rules may also be different in other embodiments.

[00105] Either one uniqueness rule or a combination of uniqueness rules may be applied to each element type. For example,

1. The uniqueness rule "Length of the element" may be applied to the element type

"One-line Comments". 2. The uniqueness rule "Length of the element" may be applied to the element type "Declared Package Names", starting (in some embodiments) with a hierarchy of 2 levels, e.g. "com.mycompany". An example element extracted from the source code 900 is "insight.common".

3. The uniqueness rule "Length of the element" may be applied to the element type "Method Names". The elements categorized under the element type "Method Names" may also be compared to the blacklist of common/generic words.

4. The uniqueness rule "Length of the element" may be applied to the element type "Classes Names". The elements categorized under the element type "Classes Names" may also be compared to the blacklist of common/generic words.

5. The uniqueness rule "Length of the element" may be applied to the element type "File Name". The elements categorized under the element type "File Name" may also be compared to the blacklist of common/generic words.

[00106] Fig. 9B shows a table of elements extracted from the source code 900, classified as unique identifying elements or generic elements according to an embodiment. Column 902 shows the various element types, column 904 shows the elements determined as generic, and column 906 shows the elements determined as unique identifying elements. [00107] Row 908 shows elements, e.g. "this is my comment for the interestingMethodAction" and "Gets today's date", categorized the element type "One-line Comments" determined as unique identifying elements. Row 910 shows elements, e.g. "insight.common" and "insight.common.util", categorized the element type "Declared Package Names" determined as unique identifying elements. Row 912 shows an element, e.g. InterestingMethodAction, categorized the element type "Method Names" determined as a unique identifying element. These elements may have a length above a predetermined length threshold if the uniqueness rule "Length of the element" is applied. [00108] By applying the uniqueness rule "Length of the element", Elements such as "getID" and "setID", having a length below a predetermined length threshold may not be determined as a unique identifying element. Elements having a length below a predetermined length threshold may be excluded to improve the accuracy of the search and to reduce false positives.

[00109] Row 912 also shows an element, e.g. GetCurrentDate, categorized the element type "Method Names" determined as a generic element. Row 914 shows an element, e.g. GeneralUtil, categorized the element type "Classes Names" determined as a generic element. Row 916 shows an element, e.g. GeneralUtil, categorized the element type "File Name" determined as a generic element. These elements may be found in the blacklist of common/generic words, and will therefore not be determined to be "unique" if the uniqueness rule applying the blacklist is applied.

[00110] When all the unique identifying elements are obtained, the web crawler 808 may proceed to perform searches with a plurality of combinations of the unique identifying elements. Searches may be performed in a descending order of relevance, starting with the highest relevance, i.e. matches to all unique identifying elements. The web crawler 808 may perform searches starting from the more relevant element type "One-line comments" to the less relevant element type "File names". There can be e.g. thirty-one types of combination searches from the e.g. five elements types that the web crawler 808 analyzes. [00111] The thirty-one types of combination searches are listed in the following:

Types of Combinations:

1^st: All One-line Comments + All Packages + All Methods + All Classes + File name =

Highest relevance

2^nd: 0 One-line Comments + All Packages + All Methods + All Classes + File name 3^rd: 0 One-line Comments + 0 Packages + All Methods + All Classes + File name

31^st: 0 One-line Comments + 0 Packages + 0 Methods + 0 Classes + File name =

Least relevance

[00112] After a specific combination search is completed, the next unique identifying element in the same element type may be used for the subsequent combination search. To reduce the number of results, the user may configure a limit to the maximum number of results returned from each combination search.

[00113] After the search results are obtained, they may be ranked in a descending order of relevancy. Relevancy may be computed using the following formula: Relevancy value = CombinationPoints / TotalSearchResults where

CombinationPoints = (One-Line Comment * Points per comment) + (Declared Package Name * Points per package) + (Method Name * Points per method) + (Class Name * Points per class) + (File Name * Points per filename) and

TotalSearchResults = the number of results retrieved when searching using that combination.

[00114] CombinationPoints may be divided by TotalSearchResults to provide higher weight to combinations that result fewer results, i.e. more unique. For example:

Case 1: Calculation for a combination search using one Class Name which returns 100 records

Relevancy value = [(O * 25) + (0 * 18) + (0 * 13) + (1 * 10) + (0 * 2)] / 100 = 0.1 Case 2: Calculation for a combination search using one File Name which returns 1 record Relevancy value = [(O * 25) + (0 * 18) + (0 * 13) + (0 * 10) + (1 * 2)] / 1 = 2

[00115] In this example, the result of Case 2 is ranked higher in terms of relevancy than the result of Case 1 although Case 1 uses a more relevant element type. [00116] Fig. 10 shows a flowchart of a process for determining the set of unique identifying elements that identify the sensitive source code module accessed from the source code repository according to an embodiment of the invention. In 1002, one or more elements may be extracted from the sensitive source code module. For each extracted element, the element may be checked to determine whether it is a unique identifying element based at least in part on a length of the element in 1004. The element may not be a unique identifying element if it has a length below a predetermined length threshold. In 1006, the element may be checked whether the element appears on a blacklist of common or generic words to determine if the element is a unique identifying element.

[00117] In 1008, the elements may be categorized according to element types. The element types may include: one-line comments; declared package names; method names; class names; and file names. In 1010, a number of points may be provided for each element type. In 1012, a total number of points may be assigned to each of the search results based on a product of a number of unique identifying elements of a particular element type that were matched and the number of points for the particular element type to determine a relevancy for each of the search results. In 1014, the total number of points may be divided by the number of search results to determine a relevancy for each of the search results. [00118] While embodiments of the invention have been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.

Claims

ClaimsWhat is claimed is:

1. A method of tracing a source of leaked information owned by an organization, after the information has been leaked on an external network, the method comprising: building an identity-relationship database containing information linking, either directly or indirectly, an employee of the organization to one or more network identities, and to network identities of others with whom the employee communicates; locating the leaked information on a site on the external network; determining a network identity of the poster of the leaked information on the external network; determining whether one or more links in the identity-relationship database connect the network identity of the poster of the leaked information to the employee; and if so, identifying the employee as a possible source of the leaked information.

2. The method of claim 1, wherein building an identity-relationship database comprises: extracting data from the organization's internal human resources database and network servers, the name of an employee and one or more internal network identities used by the employee, and linking the employee to the one or more internal network identities in the identity-relationship database.

3. The method of claim 1 or 2, wherein building an identity-relationship database comprises: intercepting a message being sent over a network; extracting a network identity of the message sender from the message; extracting a network identity of the intended recipient from the message; and linking the network identity of the message sender to the network identity of the intended recipient.

4. The method of claim 3, wherein building an identity-relationship database further comprises recording the frequency of communication between the network identity of the message sender and the intended recipient.

5. The method of any of claims 1 to 4, wherein building an identity-relationship database further comprises: searching one or more sites on the external network for a searched network identity in the identity-relationship database to produce search results; detecting a relationship between a connected network identity and the searched network identity in the search results; and linking the connected network identity to the searched network identity in the identity-relationship database.

6. The method of claim 5, wherein searching one or more sites on the external network comprises using a Web crawler to search one or more sites on the external network.

7. The method of claim 5, wherein searching one or more sites on the external network comprises searching social networking sites.

8. The method of claim 5, wherein searching one or more sites on the external network comprises searching blog sites, discussion forum sites, or other sites that permit posting of messages or content.

9. The method of claim 5, wherein searching one or more sites on the external network comprises searching external e-mail and/or instant messaging sites.

10. The method of any of claims 1 to 9, wherein building an identity-relationship database comprises determining a closeness of a connection between a first network identity and a second network identity.

11. The method of claim 10, wherein determining a closeness of a connection comprises determining a type of a detected relationship, wherein each type of detected relationship is associated with a proximity value that is used to determine the closeness.

12. The method of claim 11, wherein determining a type of a detected relationship comprises determining a type selected from: the first network identity and the second network identity are declared friends on a social networking site; the first network identity and the second network identity send personal communications to each other via instant messaging; the first network identity and the second network identity send personal communications to each other via email; hyperlinks exist between a blog of the first network identity and a blog of the second network identity; the first network identity has posted a comment on a blog of the second network identity; the first network identity and the second network identity have communicated via corporate email; and the first and second network identities have both posted messages in the same thread on a blog and/or discussion forum.

13. The method of any of claims 10 to 12, wherein determining a closeness of a connection comprises determining a frequency of communication between the first network identity and the second network identity.

14. The method of any of claims 10 to 13, wherein identifying the employee as a possible source of the leaked information comprises using the closeness of the connections between the employee and the network identity of the poster of the leaked information to determine a likelihood that the employee is the source of the leaked information.

15. The method of any of claims 1 to 14, wherein locating the leaked information comprises using a Web crawler to locate the leaked information.

16. The method of any of claims 1 to 15, further comprising: if the network identity of the poster of the leaked information is not in the identity- relationship database: using the network identity of the poster of the leaked information as a target network identity; searching one or more sites on the external network for the target network identity to produce search results; detecting a relationship between the target network identity and a connected network identity in the search results; determining whether the connected network identity is in the identity- relationship database; linking the target network identity to the connected network identity in the identity-relationship database, if the connected network identity is in the identity relationship database; and repeating the process with the connected network identity as the target network identity, if the connected network identity is not in the identity relationship database.

17. The method of any of claims 1 to 16 further comprising collapsing links in the identity- relationship database so that all links between the employee and the network identities of others are direct links.

18. The method of any of claims 1 to 17, further comprising sending an alert to a management device if the employee is identified as a possible source of the leaked information.

19. A system for tracing a source of leaked information owned by an organization, after the information has been leaked on an external network, the system comprising: an identity-relationship database containing information linking, either directly or indirectly, an employee of the organization to one or more network identities, and to network identities of others with whom the employee communicates; and a network device, the network device connected to the identity-relationship database and to the external network, the network device configured to: locate the leaked information on a site on the external network; determine a network identity of the poster of the leaked information on the external network; determine whether one or more links in the identity relationship database connect the network identity of the poster of the leaked information to the employee; and if so, identify the employee as a possible source of the leaked information.

20. The system of claim 19, wherein the identity-relationship database comprises: data linking the employee to the one or more internal network identities in the identity-relationship database, said data being extracted from the organization's internal human resources database and network servers.

21. The system of claim 19 or 20, wherein the identity-relationship database comprises data linking a network identity of a message sender to a network identity of an intended recipient, the network identity of the message sender and the network identity of the message recipient being extracted from an intercepted message being sent over a network.

22. The system of claim 21, wherein the identity-relationship database further comprises data on the frequency of communication between the network identity of the message sender and the intended recipient.

23. The system of any of claims 19 to 22, further comprising a Web crawler configured to: search one or more sites on the external network for a searched network identity in the identity-relationship database to produce search results; detect a relationship between a connected network identity and the searched network identity in the search results; and link the connected network identity to the searched network identity in the identity- relationship database.

24. The system of claim 23, wherein the Web crawler is configured to search social networking sites.

25. The system of claim 23, wherein the Web crawler is configured to search blog sites, discussion forum sites, and/or other sites that permit posting of messages or content.

26. The system of claim 23, wherein the Web crawler is configured to search external e-mail and/or instant messaging sites.

27. The system of any of claims 19 to 26, wherein the identity-relationship database further comprises data relating to a closeness of a connection between a first network identity and a second network identity.

28. The system of claim 27, wherein the data relating to a closeness of a connection comprises a type of a detected relationship, wherein each type of detected relationship is associated with a proximity value that is used to determine the closeness.

29. The system of claim 28, wherein the type of the detected relationship comprises at least one of: a first type, wherein the first network identity and the second network identity are declared friends on a social networking site; a second type, wherein the first network identity and the second network identity send personal communications to each other via instant messaging; a third type, wherein the first network identity and the second network identity send personal communications to each other via email; a fourth type, wherein hyperlinks exist between a blog of the first network identity and a blog of the second network identity; a fifth type, wherein the first network identity has posted a comment on a blog of the second network identity; a sixth type, wherein the first network identity and the second network identity have communicated via corporate email; and a seventh type, wherein the first and second network identities have both posted messages in the same thread on a blog and/or discussion forum.

30. The system of any of claims 27 to 29, wherein the data relating to the closeness of a connection comprises a frequency of communication between the first network identity and the second network identity.

31. The system of any of claims 27 to 30, wherein the network device is configured to identify the employee as a possible source of the leaked information by using the closeness of the connections between the employee and the network identity of the poster of the leaked information to determine a likelihood that the employee is the source of the leaked information.

32. The system of any of claims 19 to 31, comprising a Web crawler configured to locate the leaked information on the external network.

33. The system of any of claims 19 to 32, wherein if the network identity of the poster of the leaked information is not in the identity-relationship database, the network device is configured to: use the network identity of the poster of the leaked information as a target network identity; search one or more sites on the external network for the target network identity to produce search results; detect a relationship between the target network identity and a connected network identity in the search results; determine whether the connected network identity is in the identity-relationship database; link the target network identity to the connected network identity in the identity- relationship database, if the connected network identity is in the identity relationship database; and repeat the process with the connected network identity as the target network identity, if the connected network identity is not in the identity relationship database.

34. The system of any of claims 19 to 33, further comprising a management device, and wherein the network device is configured to send an alert to the management device if the employee is identified as a possible source of the leaked information.