WO2017050991A1

WO2017050991A1 - Aggregating profile information

Info

Publication number: WO2017050991A1
Application number: PCT/EP2016/072737
Authority: WO
Inventors: Razvan DINU; Tom SAVAGE; Alexandru George CAZACU; George IONITA; Mihai BOGDAN; Traian REBEDEA
Original assignee: 3Desk Ltd
Priority date: 2015-09-25
Filing date: 2016-09-23
Publication date: 2017-03-30
Also published as: GB201517008D0; GB2543740A

Abstract

A method of aggregating profile information, comprising: automatically gathering peoples' profiles from multiple websites; identifying multiple features in each profile; normalizing the values of the features into a common format; and forming a profile graph by representing each of the profiles as a corresponding node in the profile graph, and connecting each of a plurality of pairs of the nodes with one or more edges representing matches between the values of features found in both profiles. The method further comprises: matching together different ones of the profiles into groups based on the edges between the corresponding nodes in the profile graph, each group estimated to be the profiles of a respective same one of the people; for each of the groups, aggregating at least some of the profiles of the group into an aggregate profile of the respective person; and making the aggregate profiles available via the Internet.

Description

Aggregating Profile Information

Background Over 30% of all online searches (i.e. 3 billion per day) are searches for people, yet there is no efficient way to comprehensively search someone's online footprint. To harness a person's complete profile one would have to manually go from site to site visiting that person's profile on each different site containing information about them. There are currently 21 billion profiles on the web, and within two years the number is expected to reach 50 billion. As the amount of data grows exponentially, the problem is expected to get worse.

People searches are conducted billions of times per day by individuals and organizations in the public and private sector, for various reasons, including sales (e.g. identifying and qualifying influences in the buying process), marketing (targeting communications), recruitment (sourcing candidates), finance (e.g. credit checking) and monitoring political, social and environmental issues. "People data" also powers many products, such as recommendation engines in shopping and search sites.

A number of companies have attempted to build a people search solution, yet no one to date has managed to create a significant, accurate data set. The result is an incomplete user experience, and insufficient data to power the most valuable use cases. Either decisions are made based on limited information, or time and cost are incurred through having to search supplementary data by hand. Summary

The present disclosure provides techniques for gathering, normalizing, decorating and aggregating a large number of profiles (in embodiments tens of millions) in a fast and efficient manner. It uses a graph based architecture and graph-based pattern matching in order to accurately match together the profiles of a given person gathered from multiple different sources on the web, then make the aggregated profiles available through any of an number of potential channels such as a web-based search engine, API, or plugin to another application.

According to one aspect of the present disclosure, there is provided a method of aggregating profile information, comprising: from multiple websites, automatically gathering profiles of multiple people profiled on those websites via the Internet, including, for at least some of the people, gathering multiple profiles of the same person from different ones of the websites; identifying multiple features in each of the profiles, including determining a value of each of the features; normalizing the values of the features of each profile into a common format; forming a profile graph by representing each of the profiles as a corresponding node in the profile graph and, based on the normalization into said common format, connecting each of a plurality of pairs of the nodes with one or more edges, each edge representing a match between the values of one of the features found in both profiles represented by the pair of nodes; matching together different ones of said profiles into groups based on the edges between the corresponding nodes in the profile graph, each group estimated to be the profiles of a respective same one of the people; for each of the groups, aggregating at least some of the profiles of the group into an aggregate profile of the respective person; and making the aggregate profiles available via the Internet.

In embodiments, the matching of the profiles may comprise: for each of the pairs of nodes, matching the corresponding profiles together into the same group in dependence on a number of other nodes in common within a predetermined number of hops in said profile graph.

In embodiments, for at least some of said plurality of pairs of nodes, the nodes in the pair may be connected by more than one edge, each edge representing a match between the values of a different respective one of a plurality of said features found in both profiles represented by the pair of nodes. In embodiments, the matching of the profiles may comprise: for each of the pairs of nodes, matching the corresponding profiles together into the same group in dependence on a number of edges between the nodes of the pair. In embodiments, the matching of the profiles may comprise: for each of the pairs of nodes, matching the corresponding profiles together into the same group in dependence on a frequency of occurrence within the profile graph of a value of one of the features represented by an edge between the corresponding nodes. In embodiments, the matching of the profiles may comprise: for each of the pairs of nodes, matching the corresponding profiles together into the same group in dependence on a measure of similarity between non-exactly matching values of a feature found in both the corresponding profiles. In embodiments, said identifying may comprise an entity extraction phase which identifies which of the features occur in different ones of the profiles and represents each of those features as an entity in an entity graph, and which further identifies relationships between the features and represents the relationships in the entity graph; wherein the entity graph may be an input to the step of forming the profile graph.

In embodiments, the features may include at least one of: name, academic institution, skills, occupation, employer, company, interests, and/or place of residence.

In embodiments, the identification of the features is performed at least in part using natural language processing.

In embodiments, the method may further comprise a validation phase in which, for at least some of the groups, one or more of the profiles are eliminated from the group in

dependence on a further comparison between the profiles in that group; and said aggregation may aggregate only the profiles remaining after said elimination. In embodiments, the further comparison may comprise determining whether the group contains profiles from the same website, and if so, the validation phase may eliminate one of the profiles from the same website. In embodiments, the aggregated profiles may be made available through a searchable user interface.

In embodiments, when a search query is entered through said user interface searching for a value not yet represented in the profile graph, the search query may automatically trigger a gathering, via the Internet, of one or more further profiles from one or more websites based on the search query; and the method may further comprise updating the profile graph to include the one or more further profiles, and based thereon generating a new aggregate profile for a new person and/or an update to one or more of the existing aggregate profiles In embodiments, the method may be performed by a first provider, and the making available of the aggregated profiles may comprise: making the aggregate profiles available to the public through a website run by the first provider.

In embodiments, the method may be performed by a first provider, and the making available of the aggregated profiles may comprise: providing the aggregate profiles to a plugin of a web browser or other internet-enabled application provided by a second provider, such that the second provider can make the aggregate profiles available to users of said application. In embodiments, the method may be performed by a first operator, and the making available of the aggregated profiles may comprise: providing the aggregate profiles to an API of a computer system run by a second provider, so the second provider can make the aggregate profiles available to users of said computer system. According to another aspect disclosed herein, there is provided a server configured to perform the operations of any method disclosed herein. According to another aspect of the present disclosure, there is provided a computer program product comprising code embodied on a computer-readable storage medium, and configured so as when run on one or more processors to perform operations of any method disclosed herein.

Brief Description of the Drawings

To assist understanding of the present disclosure and to show how it may be put into effect, reference is made by way of example to the accompanying drawings in which:

Figure 1 is a schematic block diagram of a computer network,

Figure 2 is a flow chart showing a method of aggregating profile information,

Figure 3 is a schematic illustration of a user interface, and

Figure 4 is a schematic representation of a graph-based matching process.

Detailed Description of Preferred Embodiments

Figure 1 gives an overview of a system arranged in accordance with embodiments of the present disclosure. The system comprises: a server of an aggregator service 102, the servers of multiple websites 103, a plurality of user terminals 104, and optionally the server of a third party 105 providing another service other than the aggregator or websites. Each of these components 102, 103, 104, 105 is coupled to the Internet 101 via any of a variety of wired and/or wireless technologies. It is by means of this arrangement that the various interactions described below occur. Note that a server herein refers to a logical entity which may comprise one or more server units at one or more geographical sites. Each of the user terminals 104 may take any suitable form, such as smartphone, tablet, laptop, desktop computer or set-top box.

Each of the websites 103 is a social media site or the like, whereby multiple different users can post profile information about themselves, and/or by which users can post profile information about others, via the Internet 101 using various ones of the user terminals 104. Users can also view the profiles individually from the individual websites through their user terminals 104. A given user often has multiple different profiles on each of multiple websites 103, and each profile may consist of a different selection of information about the user. E.g. a professional networking site may contain different information for the same user than a social networking site. Hence to obtain all the profile information on a given person, someone would have to visit all the sites individually. Nowadays that can be a lot of sites, making this a laborious task. Also, it may not be easy to find all the different sources of profile information.

To address this, the aggregator 102 is arranged to "crawl" multiple different websites 103 via the Internet 101 in order to automatically gather together some or all of the different profiles of each of multiple individuals, and to aggregate the different profiles into an aggregate (combined) profile for each of these people. The aggregator 102 then makes these aggregate profiles available to the user terminals 104 of other users via the Internet 101 (not necessarily the same user terminals 104 through which the profile information was originally submitted to the websites 103, though there may well be some overlap). The aggregate profiles could be made available in a searchable fashion through a special proprietary searching website run by the aggregator 102. Alternatively or additionally, the aggregate profiles could be made available through a third party system, product or service 105, by means of an API or plugin application configured to interface between the aggregator 102 and the third party's system, product or service 105. The third party 105 can thus in turn make the aggregate profiles available in a searchable fashion to its own users.

The process performed by the aggregator 102 is now described in more detail with reference to Figure 2.

The method begins at step 210 with a gathering phase. Here, the aggregator 210 gathers profile information for multiple people from multiple different sources on the web. It does this by a process called "scraping". In embodiments, the aggregator 102 comprises a separate scraper module for each of the different websites 103 it is arranged to recognize, each scraper being configured to interact with and parse the content of a different respective one of the websites 103. The scraper submits an HTTP request to the respective website it is designated to scrape, including an identification of a target person in the request (e.g. name, username, or email address). In response the website 103 returns the relevant content for that person. The scraper then parses the returned content to recognize various features that may be present, and extract values of those features (e.g. if the feature in question is occupation, the value may be "programmer"; and if the feature is place of residence, the value may be "San Jose"; etc.). In embodiments, the scraper is able to do this because it is pre-configured to know the predetermined format of the particular website 103 it is designed for, i.e. which fields or positions each of the different features appear in in the content returned from the website 103 in question. That is, the scraper uses pre-configured rules to know exactly where to look for a given type of information in a given page (e.g. the name is the "div" with the id "ftp- name"). Alternatively of additionally, in embodiments, the scraper may use natural language processing (NLP) to perform this task. For example NLP may be used to recognize whether or not a given page is indeed a profile person of a person, and/or to identify features (such as name, occupation, etc.) on pages where such features do not necessarily appear in fixed, predetermined fields. Thus the NLP can be used to extra features or fields from free-form text.

A third option which the scraper may be configured to use (again as an alternative or in addition to either or both of the pre-configured and/or NLP based approaches discussed above), is to use a dynamic approach based on one or more "extractors". An extractor as referred to herein is a hybrid between a pre-configured rule and a full NLP based approach, in that it begins with a predetermined rule about where to look within webpage for a relevant field, but then uses NLP to determine the meaning of that field. For example, the rule may be to examine the HI field of the page's HTML (this being the highest level of heading), or to examine the caption beneath the largest image on the page. For instance, it may be taken as a predetermined rule that the HI field is likely to contain relevant profile information, and therefore the scraper's extractor should look there. However, different pages may use the HI field for different purposes, and so from page-to-page it may include different types of profile information or sometimes no relevant profile information.

Therefore in addition to the pre-determined rule to examine the HI field, the exactor applies NLP to that field in order to try to determine its meaning, i.e. what type of information it represents - e.g. does site comprise the name of the site, the name of the user whose profile the page, etc. Thus extractors represent general rules or heuristics for extracting information out of a page (e.g. take the text from the first HI element of a page, take the images and the closest text to them, extract links that contain specific attributes, extract links with rich content next to them etc.). Further, a dynamic scraper additionally uses a set of extractors and learns which ones extract good information, i.e. applies them and then validate the information. In the validation phase (to be discussed in more detail shortly) external services can be used to validate this. E.g. for an extractor, other services can be referenced in order to validate whether the extractor extracted a name (e.g. by looking up the name in an index) or if it extracted a location (using Google Maps API). Once the information is validated, it is included in the profile. The learning can also be done applying machine learning techniques on a set of positive and negative examples.

By whatever means the scraping is implemented, the scraper repeats the process for multiple different people whose profiles appear on the respective website 103 (in embodiments millions or even billions of people). Each of the different scrapers also performs a similar process for the multiple people's profiles appearing on the respective website 103 it is responsible for scraping. Thus the aggregator 102 is able to build up a large database of profile information, including multiple profiles of any given user if that user has profiles on different sites 103.

As part of parsing the profiles of the different websites 103, the aggregator 102 stores the profile information extracted from the various different sites 103 in a common format of the aggregator, i.e. converts the data into a common schema, so that it all looks the same regardless of which website 103 it was derived from. This may be referred to herein as normalizing the data. That is, each of the websites 103 publishes its data in its own different respective format, with its own fields in certain places (or even no fixed fields at all). The aggregator 102 then affectively performs a mapping exercise, such that a certain field of the website is mapped to a certain field of the common schema (or a certain feature extracted using NLP is mapped to a certain field of the common schema). Once normalized in this manner, the profiles extracted from different websites 103 are then ready to be understood in relation to one another, i.e. to be processed together as part of a common process. Having gathered and normalized at least an initial set of data, the process then proceeds to the next phase 215, which may be referred to herein as the entity extraction phase. Here the aggregator 102 identifies entities that may repeatedly occur - e.g. an entity could be a name of a given user, a company, or a university, or a job title, etc. The purpose of this phase is so that when the aggregator encounters an entity again, e.g. a given company, it recognizes it as another instance of the same company. In the entity extraction phase, the aggregator also determines a relationship between the different entities, e.g. the relationship between user and company may be "works for", or the relationship between user and university may be "studied at". Thus, in the entity extraction phase 215 every normalized profile that comes from the scrapers is used to create one or more entities and relationships between them. The reunion of the nodes and relationships for all profiles form what may be referred to herein as "the entity graph". The nodes in this graph can be anything (e.g. a profile, a company, a website, a location, a role, a skill). If for example from a profile the entity extraction phase 215 determines that the person identified by a profile P from a certain website works for company C, then a node is created for both P and C and a relationships between them is created with the type "works_for". If a profile P links to a website W then two nodes are created and also a relationships between them with the type "links to", etc. Relationships are directed but can be traversed both ways.

The next phase 220 is referred to herein as the clustering phase. In this phase, the aggregator 102 creates a "profile graph" representing the gathered data on the various different people from the various different websites 103. The clustering phase 220 constructs the profile graph using the entity graph as its input.

A portion of such a profile graph is illustrated by way of example in Figure 4. The profile graph 400 comprises two types of element: nodes 401 and edges 402. Each node 401 represents a given profile from a given website 103 (so the different profiles from the different websites 103 each have their own respective node 401, including that the different profiles of the same person each have their own respective node 401). The edges 402 represent connections between profiles. In embodiments, there could simply be either one or zero edges between any given pair of nodes 401: i.e. they are either connected or not (e.g. based on some overall test of whether they correspond to the same person). However, preferably, in embodiments multiple edges can be allocated between any given pair of nodes 401. E.g. in the example of Figure 4, the nodes 401a and 401b are connected by three distinct edges 402i, 402Π, 402iii.

In this case, each edge 402 represents a match for a given feature. I.e. if both of a pair of nodes 401 represent profiles for which a certain feature is present (note that not all nodes necessarily have the same feature set), and if the values of that feature match, then a respective edge 402 is created between the two nodes 401. For instance, if the feature in question is name, and if the two profiles both include a name and the two values are both "Dave Example" then an edge 402i is created between the respective nodes 401a, 401b, with this edge 401i representing the feature of name. But if the values are instead, say, "Dave Example" and "Steve Forinstance", then no edge is added. And if the two values for another feature such as employer are both "SuperTechCo", then another edge 402H is added between the same pair of nodes 402a, 402b, and so forth. Other examples of features that could be used to create respective edges include: school, university, skills, hobbies, company, home town, current town of residence, country of residence, citizenship, address, email address, etc. Some such items of information could also be broken down into separate features, e.g. the name could be broken down into given name and family name, or the address could be broken down into two or more of street, town and postcode, etc.

This process is applied across all the possible combinations of node 401 in the graph 400, to try to find as many different connections for as many different features as possible. In embodiments, and edge 402 is created only for an exact match between the values of the feature in question. Alternatively however, it is not excluded that edges could be created based on an inexact match. E.g. metrics are known for measuring the similarity of two strings, and/or the aggregator 102 could be configured to recognize certain predetermined variants of a value (such as that Jim is another form of the name James).

Once the clustering phase 220 has created a suitably large profile graph 400 with a suitably large number of edges 402, the aggregator 102 proceeds to the next phase of the process, which is the profile matching phase 230. This phase looks for patterns in the profile graph 400 that indicate whether different profiles appear to belong to the same person (within some acceptable likelihood). As a simple example, one could guess that if two profiles have the same name and the same employer (so two edges 402 for two particular features), there is a 99% chance they are for the same person.

In embodiments, the profile matching phase 230 works based on any one or more of a variety of heuristics that may be evaluated based on the profile graph 400, and in embodiments based on a combination of such heuristics.

One example of such an heuristic is based on a number of other nodes in common within a predetermined number of hops in said graph (a hop is wherever nodes 401 are connected by at least one edge 402). For example, this heuristic may evaluated on a yes/no basis, such that it is true for a given pair of nodes 401 if the nodes are within a predetermined number of hops, e.g. they are adjacent neighbours (one hop) or within a path of two hops, but false otherwise. And/or, as another example, such an heuristic may measure a number of common neighbours within a predetermined number of hops (i.e. how many such neighbours exist). E.g. it may measure the number of adjacent neighbours in common (one hop), or a number of common neighbours within a path of two hops. In this case the heuristic may be true if two nodes 401 share above a threshold number of neighbours in common within the predetermined number of hops, and false otherwise; and/or the number of neighbours within the hop limit could be used as a measure of the likelihood of a match as a matter of degree The above types of heuristic can be used alone. However, such heuristics only determine whether nodes are connected or not, or to what degree they are connected, e.g. whether they are connected within some degree of separation, or whether they share a certain number of neighbours. I.e. these heuristics are only based on whether or not nodes 401 are connected at all (by any edge 402). Preferably however, to improve the matching process, the profile matching 230 may alternatively or additionally be based on one or more other heuristics that take into account the nature of the connections between nodes (i.e. one or more heuristics that make use of the fact that, in embodiments of the present disclosure, the edges are characterized as representing certain specific features).

An example of this is the number of edges between two nodes 401. Such an heuristic may be evaluated on a yes/no basis, such that it is true if the nodes share more than a threshold number of edges, e.g. two or three, and false otherwise. And/or, the number of edges may be taken as a measure of the likelihood that two nodes represent profiles of the same person, as a matter of degree. Another example is the rarity of the matching value, i.e. its frequency of occurrence within the profiles represented by the graph 400 (how many times does it occur statistically, e.g. as a proportion of the number of instances of edges representing the feature in question in the graph 400). A rare value of a given feature can be a strong indication of a link. For example, finding two profiles for the name "Dave Smith" does not give much confidence of a match, but finding two profiles for the name "Ezekial Q Nithercott" has a very low probability and therefore is a much stronger indication that they are likely to be for the same person. This heuristic could be evaluated on a yes/no basis, such that it is true if the frequency

(proportion of occurrences of the feature) is below a certain predetermined threshold, but false otherwise. And/or, the frequency could be used as a measure of the likelihood of a match as a matter of degree.

Yet another possible type of heuristic is a similarity between the values of a given feature. For instance, one or more metrics may be used that measure the similarity between two strings. Again this could be evaluated on a yes/no basis, so that true if the similarity is above a threshold and false otherwise; and/or, the similarity may be used as a measure of the likelihood of a match as a matter of degree.

Note: in embodiments where the clustering phase 220 only adds edges 401 for exact matches, the similarity is not considered to be an heuristic which qualifies an edge 401 per se. E.g. the test could be that on condition that two nodes 401 are connected by, say at least one edge 402, or at least two edges, the matching phase 230 then probes the profiles represented by those nodes further to look for features that are similar. Alternatively where edges 401 in the profile graph 400 are added in the clustering phase 220 to represent close but in inexact matches, the measure of similarity may indeed be considered as a property of the edge 401. Preferably a plurality of any two, more or all of the above metrics, and/or others, are combined in the decision making process in order to decode whether profiles are to be matched.

The output of the matching phase 230 is a second graph. Here, again each node represents a given profile from a given website 103. However, in this graph each node is connected by only one or zero edges: matched or not matched.

Optionally, a validation phase 240 is then applied to remove some of the connections (edges) from this second graph. Here, one or more further heuristics are applied to eliminate edges that represent unlikely matches. Preferably, this phase 240 at least includes eliminating edges between nodes 401 that represent profiles from the same website 103, because it is unlikely that the same person has two different profiles on the same site 103. Another example is to break up overly large bunches of nodes 401 that are still connected, on the basis that a given person is unlikely to have more than a certain number of profiles (e.g. while he or she may have a lot of profiles, numbers in the hundreds start to become unlikely). For instance, if it can be identified that two subsets of nodes each contain many common neighbours between them, but only one edge connects the two subsets, than that edge may be eliminated. The reason for the matching phase 230 and separate validation phase 240 (as opposed to, say, just not including edges between profiles from the same site 103 in the first place) is that it has been found to produce better results to find as many potential connections in the graph 400 as possible based on the heuristics used in the matching phase 230, then eliminate some in the validation phase 240; rather than be overly selective at the matching phase 230 and potentially miss some connections that might prove useful. The result of the matching phase 230 and optional validation phase 240 will be a set of discrete sub-clusters, or groups, each representing a different respective person.

Finally, the aggregator 102 proceeds to the identity building phase 250. Here it aggregates the profiles of each sub-cluster into a respective aggregate profile for the respective person, which it publishes via the Internet 101.

In one implementation, the aggregator 102 may host its own website which users can access via their user devices 104 in order to search for people from amongst the aggregated profiles. An example is illustrated in Figure 3, showing an example front-end user interface 300 of such a site. The user interface 300 comprises a search bar 301 in which a user can enter a search query, such as the name or a person or the name of a company. E.g. in the example shown, the user searches for the company name "SuperTechCo", to try to find people associated with this company. This brings up a list of results, each corresponding to a different person. For each result, the list may show a profile picture 302 included in the aggregate profile for the respective person, and/or the values of one, some or all of the features 303 in the aggregate profile (e.g. name, company, interests, etc.). In embodiments, the list may also include a set of icons or logos 304, one for each of the websites 103 from which the aggregate profile has been compiled. In further embodiments, the user may select (e.g. click or touch) one of the results in the list to summon up the complete aggregate profile 305.

Alternatively or additionally, the aggregator 102 may publish the aggregate profiles via other means, such as by making the aggregate profiles available to a plugin application or API of a third party 105 (being a different provider of a different product or service than the provider of the aggregation service 102, e.g. a different party, company, organization or legal entity). For instance, the aggregator 102 may provide or endorse a plugin application which plugs in to a web browser or to another internet-enabled application such as an email client or instant messaging (IM) client, enabling users of that application to access the people search functionality though that application via the plugin. As another example, the aggregator 102 may provide or endorse an API (application programming interface) which a third party 105 can integrate into their own computer system in order to access the people search through that system. An example application of this would be to incorporate the API into the internal computer system of a recruitment company in order to allow recruiters to collect information on job applicants or potential applicants. Note that the API may allow the aggregate profile information to be accessed in an automated fashion based on a database of names or other search criteria stored in the third party's system 105, e.g. so a recruiter can automatically update information on a large database of potential applicants they may wish to contact about new job openings.

It will be appreciated that the above embodiments have been described by way of example only. Other variants may become apparent to a person skilled in the art given the disclosure herein. The scope of the present disclosure is not limited by the described embodiments, but only by the accompanying claims.

Claims

1. A method of aggregating profile information, comprising:

from multiple websites, automatically gathering profiles of multiple people profiled on those websites via the Internet, including, for at least some of the people, gathering multiple profiles of the same person from different ones of the websites;

identifying multiple features in each of the profiles, including determining a value of each of the features;

normalizing the values of the features of each profile into a common format;

forming a profile graph by representing each of the profiles as a corresponding node in the profile graph and, based on the normalization into said common format, connecting each of a plurality of pairs of the nodes with one or more edges, each edge representing a match between the values of one of the features found in both profiles represented by the pair of nodes;

matching together different ones of said profiles into groups based on the edges between the corresponding nodes in the profile graph, each group estimated to be the profiles of a respective same one of the people;

for each of the groups, aggregating at least some of the profiles of the group into an aggregate profile of the respective person; and

making the aggregate profiles available via the Internet.

2. The method of claim 1, wherein the matching of the profiles comprises: for each of the pairs of nodes, matching the corresponding profiles together into the same group in dependence on a number of other nodes in common within a predetermined number of hops in said profile graph.

3. The method of claim 1 or 2, wherein for at least some of said plurality of pairs of nodes, the nodes in the pair are connected by more than one edge, each edge representing a match between the values of a different respective one of a plurality of said features found in both profiles represented by the pair of nodes.

4 The method of claim 3, wherein the matching of the profiles comprises: for each of the pairs of nodes, matching the corresponding profiles together into the same group in dependence on a number of edges between the nodes of the pair.

5. The method of claim 3 or 4, wherein the matching of the profiles comprises: for each of the pairs of nodes, matching the corresponding profiles together into the same group in dependence on a frequency of occurrence within the profile graph of a value of one of the features represented by an edge between the corresponding nodes.

6. The method of any preceding claim, wherein the matching of the profiles comprises: for each of the pairs of nodes, matching the corresponding profiles together into the same group in dependence on a measure of similarity between non-exactly matching values of a feature found in both the corresponding profiles.

7. The method of any preceding claim, wherein said identifying comprises an entity extraction phase which identifies which of the features occur in different ones of the profiles and represents each of those features as an entity in an entity graph, and which further identifies relationships between the features and represents the relationships in the entity graph; wherein the entity graph is an input to the step of forming the profile graph.

8. The method of any preceding claim, wherein the features include at least one of: name, academic institution, skills, occupation, employer, company, interests, and/or place of residence.

9. The method of any preceding claim, wherein the identification of the features is performed at least in part using natural language processing.

10. The method of any preceding claim, wherein:

the method further comprises a validation phase in which, for at least some of the groups, one or more of the profiles are eliminated from the group in dependence on a further comparison between the profiles in that group; and

said aggregation aggregates only the profiles remaining after said elimination.

11. The method of claim 10, wherein the further comparison comprises determining whether the group contains profiles from the same website, and if so, the validation phase eliminates one of the profiles from the same website.

12. The method of any preceding claim, wherein the aggregated profiles are made available through a searchable user interface.

13. The method of claim 12 , wherein:

when a search query is entered through said user interface searching for a value not yet represented in the profile graph, the search query automatically triggers a gathering, via the Internet, of one or more further profiles from one or more websites based on the search query; and

the method further comprises updating the profile graph to include the one or more further profiles, and based thereon generating a new aggregate profile for a new person and/or an update to one or more of the existing aggregate profiles

14. The method of any preceding claim, wherein the method is performed by a first provider, and the making available of the aggregated profiles comprises: making the aggregate profiles available to the public through a website run by the first provider.

15. The method of any preceding claim, wherein the method is performed by a first provider, and the making available of the aggregated profiles comprises: providing the aggregate profiles to a plugin of a web browser or other internet-enabled application provided by a second provider, such that the second provider can make the aggregate profiles available to users of said application.

16. The method of any preceding claim, wherein the method is performed by a first operator, and the making available of the aggregated profiles comprises: providing the aggregate profiles to an API of a computer system run by a second provider, so the second provider can make the aggregate profiles available to users of said computer system.

17. A server configured to perform the operations of any preceding claim.

18. A computer program product comprising code embodied on a computer-readable storage medium, and configured so as when run on one or more processors to perform the operations of any of claims 1 to 16.