WO2019008394A1 - Capture et extraction d'informations numériques - Google Patents

Capture et extraction d'informations numériques Download PDF

Info

Publication number
WO2019008394A1
WO2019008394A1 PCT/GB2018/051935 GB2018051935W WO2019008394A1 WO 2019008394 A1 WO2019008394 A1 WO 2019008394A1 GB 2018051935 W GB2018051935 W GB 2018051935W WO 2019008394 A1 WO2019008394 A1 WO 2019008394A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
user
task
entity
database
Prior art date
Application number
PCT/GB2018/051935
Other languages
English (en)
Inventor
Marc Sloan
Andrew O'HARNEY
Matteus TANHA
Alberto CETOLI
Stefano BRAGAGLIA
Original Assignee
Cscout Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GBGB1710993.5A external-priority patent/GB201710993D0/en
Priority claimed from GBGB1710997.6A external-priority patent/GB201710997D0/en
Priority claimed from GBGB1710995.0A external-priority patent/GB201710995D0/en
Application filed by Cscout Ltd filed Critical Cscout Ltd
Publication of WO2019008394A1 publication Critical patent/WO2019008394A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/105Human resources

Definitions

  • the present invention relates to the capture and retrieval of digital information, for example on the internet.
  • the invention provides a digital autonomous system and method for online knowledge work capture, management, and assistance, for example, to a user of a web browser.
  • Google uses a knowledge base it calls the Knowledge Graph to provide structured and detailed information about a searched topic.
  • the "enhanced" information is gathered from a wide variety of sources, and may include a list of links to other potentially related websites. For example, entering the name of a company into the Google web search engine typically results in a summary of information relevant to that company, which, in addition to providing a simple overview (e.g. a brief description) of the company, might include "enhanced” information relating to key personnel, subsidiaries, contact details, current stock price, links to related websites, and so on.
  • the "enhanced" information output in response to a search query is, however, generic in the sense that it is not tailored to a specific task being performed by the user of the web browser, but rather contained to the search session without consideration for the overall task.
  • the present invention provides an improved system and method for the capture and retrieval of information relevant to a task being performed by a user of the web browser.
  • a method in a data processing system comprising a processor and a memory, for automated retrieval of stored digital information during a user-performed task, comprising: receiving, by the data processing system, digital information accessed by a user during a current task; classifying, by the processor, the current task based on at least one of: current digital information and previous digital information received in relation to the user; comparing, by the processor, the current task against previous tasks stored on the memory having a similar classification to identify whether one or more stored previous tasks relate to the user; determining, by the processor, whether any of the identified stored previous tasks contain entities and/or relations corresponding to the current task; and upon positive determination, by the processor, providing the user with digital information extracted from the one or more identified stored previous tasks.
  • a method for automated retrieval of stored digital information during a user-performed task comprising: receiving digital information accessed by a user during a task; determining at least one entity based on said received digital information; receiving further digital information accessed by a user during a task; determining a property of said entity based on said further digital information; collating digital information associated with said entity; and providing said collated digital information to the user.
  • a computer-implemented method of processing information during a user-performed task comprising: extracting information from at least one information source accessed by a user during a task; identifying at least one of an entity and a property associated with an entity from said extracted information; associating the identified at least one of an entity and a property associated with an entity with a stored database of entities and properties thereby to update the database; in response to a user query related to a particular entity, extracting information relevant to the particular entity from the database; and providing said information relevant to the particular entity to the user.
  • a task comprises an information gathering task for a particular purpose, using at least one information source.
  • the at least one information source is accessed by the user via a network connection, such as via the Internet.
  • information is extracted automatically from the accessed at least one information source
  • said property comprises further information about said determined entity.
  • said further information about said determined entity comprises: a location, contact details, a skill, a role, a sector, an investment, or a document.
  • said property comprises an entity related to said determined entity; optionally said related entity comprises: a company, a person, a social media profile, a product, or a project.
  • the method may comprise weighting the related entities and/or properties according to the relevance and/or confidence associated with said entities and/or properties.
  • the digital information relates to at least one webpage.
  • the information being retrieved relating to the webpage may be HTML content; optionally the information further comprises the webpage URL; optionally, the information further comprises actions taken by the user while viewing the webpage.
  • a sequence of accessed websites may be mapped to a vector, for example thereby creating task/workflow embedding vectors.
  • the method may further comprise comparing a current task against previous tasks stored on the memory having a similar classification to identify whether one or more stored previous tasks relate to the current task and/or user.
  • the method may further comprise comparing the current task against previous tasks comprises identifying a primary entity corresponding to the current task, and searching for said primary entity in the stored previous tasks.
  • comparing the current task against previous tasks may comprise measuring the statistical similarity of the current task and one or more previous tasks, optionally using a trained classifier.
  • the information relating to the webpage may be retrieved by a (for example, plug-in) extension to the web browser and sent to the data processing system.
  • each user may be identified by an anonymous encrypted key.
  • the method may further comprise converting received digital information into a predetermined ontology.
  • the digital information is received as one or more first class objects.
  • receiving digital information comprises identifying entities and/or relations in the information.
  • identifying entities and/or relations in the information comprises comparing the information against a predetermined mapping.
  • identifying entities and/or relations in the information comprises using Named Entity Recognizers.
  • the method may further comprise allocating at least one of a score and a weighting to said entity identified in the information based on a confidence rating that said entity is accurately identified.
  • said further digital information accessed by a user during a task comprises information accessed by a user during the same task as said received digital information.
  • the method further comprises determining an entity representative of said task.
  • determining an entity representative of said task comprises determining an entity highly connected to other entities, weighting entities on their relevance to the task, and/or receiving an indication from the user.
  • said further digital information accessed by a user during a task comprises information accessed by a user during a previous task.
  • the further digital information is received in relation to the user.
  • the further digital information is received in relation to other users.
  • said previous task may be selected from a number of previous tasks in dependence on the relevance of the previous task to the current task.
  • the relevance of said previous task is determined on the basis of a primary entity of said current task being present in the previous task.
  • the relevance of said previous task is determined on the basis of connections between primary entities of said tasks.
  • the relevance of said previous task is determined on the basis a measure of the similarity of workflows.
  • the relevance of said previous task is determined on the basis on a comparison of the websites visited during each task.
  • the method further comprises classifying said task based on said received digital information and/or said received further digital information.
  • the method further comprises predicting, by the processor, user-desired information based on at least one of: current digital information and previous digital information; querying an external data source for external information relevant to the predicted user-desired information; and upon positive determination of external information relevant to the predicted user-desired information, receiving said external information into the memory.
  • the method further comprises determining that a task is underway.
  • the method further comprises associating the identified at least one of an entity and a property associated with an entity with the current task.
  • the method further comprises associating at least one information source with a particular task.
  • the database is a graph database.
  • providing said information relevant to the particular entity to the user comprises using a user interface.
  • a user interface configured to: retrieve digital information from a webpage being accessed by a user performing a task on a web browser; transmit said retrieved digital information to a data processing system; receive stored digital information from the data processing system; and output said received stored digital information to said user; wherein the digital information output to the user is continually updated during performance of the task based on the webpages accessed by the user.
  • the user interface outputs the digital information within the web browser.
  • the output is in the form of a user interface element that is configured to display different digital information according to the type of information, for example web-links, email addresses, free text.
  • the user interface comprises a web browser extension arranged to communicate digital information with the web browser.
  • the means for providing said information relevant to the particular entity to the user comprises a user interface as described herein.
  • a system for automated retrieval of stored digital information during a user-performed task comprising: means for receiving digital information accessed by a user during a current task; means for classifying, by the processor, the current task based on at least one of: current digital information and previous digital information received in relation to the user; means for comparing the current task against previous tasks stored on the memory having a similar classification to identify whether one or more stored previous tasks relate to the user; means for determining whether any of the identified stored previous tasks contain entities and/or relations corresponding to the current task; and means for providing the user, upon positive determination, with digital information extracted from the one or more identified stored previous tasks.
  • a system for automated retrieval of stored digital information during a user-performed task comprising: means for receiving digital information accessed by a user during a task; means for determining at least one entity based on said received digital information; means for receiving further digital information accessed by a user during a task; means for determining a property of said entity based on said further digital information; means for collating digital information associated with said entity; and means for providing said collated digital information to the user.
  • the system comprises a computing device in communication with a data processor, wherein the computing device is configured to capture digital information accessed by a user during a current task and to send the captured information to the data processor.
  • the digital information accessed by the user is on a webpage, preferably wherein said digital information comprises at least one of (HTML) content and the URL.
  • the computing device is configured to allow the user to access the webpage via a web browser.
  • the web browser comprises a (for example, plug-in) extension that is configured to capture the digital information on the web page.
  • a method for classifying user-performed tasks comprising: receiving a sequence of user accessed websites corresponding to a user performed task; mapping said sequence of user accessed websites to a classification vector; and classifying said user-performed task as a particular task in dependence on a confidence level associated with said classification vector.
  • a computer-implemented method of classifying a sequence of user accessed websites in accordance with user-performed tasks comprising: receiving a sequence of user accessed websites; and using a trained classifier, classifying sub-sequences of user accessed websites as particular user-performed tasks. In such a way, a user's task can be automatically classified which can lead to automatically providing the user with information relevant to the task.
  • classifying said task comprises using a trained classifier.
  • the trained classifier comprises a recurrent neural network.
  • the method further comprises training the classifier using a labelled sequence of website vectors as an input, thereby to build a(n internal) representation of the task.
  • the method further comprises projecting said representation of the task onto said classification vector.
  • the method further comprises training the classifier to classify a sequence as belonging to a predefined community, and how well a sequence belongs to a classification.
  • said sequence of user accessed website is represented as a vector.
  • the method further comprises splitting the sequence into at least one sub-sequence of accessed websites.
  • said at least one sub-sequence is mapped to a classification vector. For accuracy, the sub-sequences of website vectors may be iteratively broken or joined to reach an optimal classification quality.
  • the method further comprises determining a community of websites, said community comprising one or more webpages relating to a particular category of information.
  • the confidence level associated with said classification vector comprises a measure of prediction accuracy; optionally said measure of prediction accuracy comprises the perplexity of said classification vector.
  • said classification vector comprises a list of probabilities of the sequence belonging to a specific class.
  • the method further comprises determining the start and/or end of said task.
  • a system for classifying user-performed tasks comprising: means for receiving a sequence of user accessed websites corresponding to a user performed task; means for mapping said sequence of user accessed websites to a classification vector; and means for classifying said user-performed task as a particular task in dependence on a confidence level associated with said classification vector.
  • a method for predictive searching of databases comprising: receiving digital information accessed by a user during a task into a memory; determining relevant data related to said task not stored within said memory; retrieving said relevant data from an external data source; and presenting said relevant data to the user; wherein said relevant data is determined in dependence on said data related to said task stored within said local memory.
  • a computer-implemented method of predictive searching of at least one information source comprising: extracting information from at least one information source accessed by a user during a task into a database; using the information in the database, identifying further information that is likely to be of relevance to the task, wherein the further information is not included in the information in the database; extracting the further information from at least one information source into the database; and presenting said further information to the user.
  • relevant data can be presented to a user without them having to proactively search for it.
  • the method may further comprise determining a classification of the task based on said received digital information; wherein said relevant data related to said task not stored within said memory is determined in dependence on said determined task classification.
  • determining a classification of the task based on said received digital information comprises: receiving a sequence of user accessed websites corresponding to a user performed task; mapping said sequence of user accessed websites to a classification vector; and classifying said user-performed task as a particular task in dependence on a confidence level associated with said classification vector.
  • said relevant data related to said task not stored within said memory is determined in dependence on one or more identified entities in the contents of the memory.
  • the method further comprises predicting a future task in dependence on said task and wherein relevant data related to said task not stored within said memory is determined in dependence said predicted future task.
  • the memory may comprise digital information related to previous tasks and/or tasks performed by other users.
  • the digital information related to previous tasks and/or tasks performed by other users is used to determining relevant data related to said task not stored within said memory.
  • the digital information accessed by a user during a task comprises information relating to an entity. So that the data is relevant to a primary entity, the method may further comprise identifying a primary entity in the digital information accessed by a user during a task in said memory, wherein the relevant data related to said task not stored within said memory relates to the primary entity.
  • retrieving said relevant data from an external data source may comprise querying for data related to the primary entity.
  • retrieving said relevant data from an external data source may comprise scraping a website.
  • retrieving said relevant data from an external data source may comprise querying an external application program interface (API).
  • API application program interface
  • the method may further comprise mapping said relevant data to the input of said API.
  • presenting said relevant data to the user may comprise compiling the relevant data retrieved from the external data source with digital information accessed by a user during a task on said memory.
  • presenting said relevant data to the user may comprise linking said data to data already in said memory.
  • said data may be linked to data relating to previous tasks.
  • a system for predictive searching of databases comprising: means for receiving digital information accessed by a user during a task into a memory; means for determining relevant data related to said task not stored within said memory, said relevant data being determined in dependence on said data related to said task stored within said local memory; and means for retrieving said relevant data from an external data source; and means for presenting said relevant data to the user.
  • the invention also provides a computer program or a computer program product for carrying out any of the methods described herein, and/or for embodying any of the apparatus features described herein, and a computer readable medium having stored thereon a program for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein.
  • the invention also provides a signal embodying a computer program or a computer program product for carrying out any of the methods described herein, and/or for embodying any of the apparatus features described herein, a method of transmitting such a signal, and a computer product having an operating system which supports a computer program for carrying out the methods described herein and/or for embodying any of the apparatus features described herein.
  • Semantic web technologies e.g. OntoText & Cambridge Semantics
  • the present invention has no natural language query and retrieves useful information, preferably in the form of one or more first class information entities, rather than URLs. Information is augmented directly into the current webpage being accessed by the user.
  • the present invention may be considered to function as a task completion assistant for professionals conducting online research (e.g. knowledge workers such as recruiters, salespeople, investors, academics, analysts, etc.). It saves the researcher time, and provides structure to their research task while promoting best practice and allows organisations / individuals to make the most of the work being done every day by hundreds of millions of workers. In this way, a user may be provided with related information featured on other websites that assists them in their present (e.g. research) task without having to navigate to other websites whereby to collate the information themselves.
  • knowledge workers such as recruiters, salespeople, investors, academics, analysts, etc.
  • Task-oriented products such as Cortana, Google Knowledge Box, for example, do not focus on professional task assistance, nor do they capture and/or represent current work being done across websites/applications.
  • Known technologies provide no ability to customise a search to fulfil a task, nor can the searches be queried against or shared.
  • Sector specific information aggregators such as Entelo, DueDil and FullContact, for example, have no understanding of overall workflow and do not dynamically present information based on information needs.
  • Task capture software such as ATLAS Recall take periodic screenshots of a user's computer screen, which may include the user's web browsing content, and use optical character recognition to store and index this content using conventional search technology. This content is not collated into tasks and no meaning is ascribed to what the user was doing with that content. Further, the only mechanism of retrieving the content is to perform a natural language search.
  • the present invention assists users with web-based tasks by keeping track of the work they are currently doing, which information is stored in an easily retrievable format. It uses techniques from natural language processing, knowledge representation and reasoning, deep and reinforcement learning, and dynamic/task based information retrieval to break the sequence of pages a user looks at down to the task level, predict intent, and deliver relevant information around this context based on a personalised knowledge graph.
  • Tasks preferably connotes a discrete activity performed by a user (or multiple users) for a particular purpose.
  • Tasks may include tasks performed by a user online, such as using social profiles to assess a candidate for recruitment, investigating a company for "know your customer” (KYC) purposes, or reading academic research papers, for example.
  • a task may include at least one of the following actions: using a web browser to perform (e.g. Google) searches; read a webpage, click a link, copy and paste content, complete a form, etc. These actions may occur in a sequence referred to herein as a "workflow".
  • the term "computing device” preferably connotes an electronic device having data input/output capabilities, a processor arranged to run software and a digital display, preferably configured to display said output in graphical form.
  • digital information preferably connotes information that can be managed and retrieved by a computing device, which information is stored (usually electronically) using a series of ones and zeros.
  • the term “mechanism” preferably connotes elements of the present invention that perform various operations, functions, and related aspects.
  • the term 'first class object' or 'first class entity' preferably connotes an object or entity which supports all mathematical / processing operations.
  • the term 'vector' preferably connotes a one dimensional array.
  • a vector is preferably a type of first call object.
  • the term 'ontology' preferably connotes a description of a domain, where the ontology is made up of a collection of concepts / classes / entities and the properties / relations between such concepts / classes / entities.
  • the term 'knowledge base' preferably connotes an ontology augmented with a set of rules that allow patterns in the information provided in the knowledge base to be found.
  • the term 'knowledge graph' refers to a knowledge base having data organized as a graph and/or implemented using a graph database.
  • the terms 'knowledge graph' and 'knowledge base' may be understood to be interchangeable.
  • web browser extensions such as may be used with Google Chrome (RTM), for example
  • RTM Google Chrome
  • the web browser extension described herein serves two main purposes: workflow collection and task card display.
  • a task card provides a user with information about their current workflow and task within the web browser extension, as well as optionally previous workflows and/or information predicted to be relevant to the user's future tasks.
  • Any apparatus feature as described herein may be provided as a method feature, and vice versa.
  • means plus function features may be expressed alternatively in terms of their corresponding structure.
  • Figure 1 shows an exemplary system according to the present invention
  • FIG. 2 shows a system overview in more detail
  • Figures 3A and 3B show the system architecture
  • Figure 3C shows a flow diagram showing the steps of a computer-implemented method of processing information during a user-performed task for use with the system
  • Figure 3D shows a flow diagram showing the steps of a computer-implemented method of classifying a sequence of user accessed websites in accordance with user-performed tasks for use with the system;
  • Figure 3E shows a flow diagram showing the steps of a computer-implemented method of predictive searching of at least one information source for use with the system
  • Figure 4 illustrates the knowledge / workflow representation aspect in more detail
  • Figure 5 shows an example of domain taxonomy
  • Figure 6 shows the taxonomy of Figure 5 with relationships shown
  • Figure 7 shows a map of classified websites created for classifying a user workflow
  • Figure 8 shows a neural network classifying a user workflow
  • Figure 9 illustrates the vectorising of a document
  • Figure 10 shows an exemplary task card
  • Figure 1 1 shows a graphical knowledge base that provides information for the task card
  • Figure 12 shows an example of a Question Answering mechanism
  • Figure 13 shows a history of previous tasks
  • Figures 14A and 14B show an example of how relevant information can be captured and presented to a user during a task
  • Figures 15A and 15B show an example of a knowledge graph showing previous stored information relating to the task of Figure 12A, and the presentation of stored information;
  • Figures 16A and 16B show how the stored information illustrated in Figure 13A may be retrieved for automated population of text fields;
  • Figure 17 shows another example of how relevant information can be captured and presented to a user during a task
  • Figure 18 shows an example of the stored information being presented to a user via a mobile computing device
  • Figure 19 shows a schematic representation of a present graph
  • Figure 20 shows a schematic representation of a past graph
  • Figure 21 shows a pipeline for constructing the past entity search
  • Figure 22 shows a schematic representation of a future graph
  • Figure 23 shows a schematic representation of a super graph
  • Figure 24 shows a schematic representation of a data wrapper for importing data from third parties into the system
  • Figure 25 shows the flows of information through the system
  • Figure 26 shows the schematic operation of a task manager of the system and associated components
  • FIG 27 shows the architecture of an integration description language (IDL).
  • IDL integration description language
  • Figure 28 shows a graph of pages visited by a user
  • Figure 29 shows a vector transformation of the graph of Figure 28
  • Figure 30 shows the architecture of a neural net for predicting the user related to the graph of Figure 28.
  • Figure 31 shows schematic hardware components configured to implement the described system.
  • Figure 1 presents an exemplary system 100 according to the present invention in which a user is accessing the internet via a web browser running on a computing device.
  • a web browser extension running on the web browser, monitors the webpages accessed by the user, retrieves certain information from the accessed webpages and communicates information relating to the accessed webpages between the web browser and a separate data processing system comprising a processor and a memory.
  • Knowledge workers typically use the internet (or "web") to complete professional tasks, which may involve performing multiple searches, manual information consolidation, and translational effort in moving/sharing completed work between formats, applications, and people. This is inefficient and relies on the time, skill, and memory of the worker.
  • the present system takes, as input, current information (e.g. HTML content, URL) from a webpage being accessed by a user and, optionally, user actions (e.g. mouse hovers, clicks, drags, etc.).
  • Facts e.g. entities + relations
  • the present task being performed by the user is then classified in the context of both the current information and previous information received relating to that user (e.g. company research, candidate research, technical question answering, etc.).
  • the "work” and “knowledge” may be represented in graphical format where tasks, entities, and relationships accessed and viewed by the user are represented as "first class objects" that can be queried and logically reasoned.
  • the representation of work and knowledge in graphical format facilitates a generic query mechanism.
  • a body of text e.g. HTML content
  • a piece of work e.g. a task performed by a user
  • the process involves receiving a request from the user and automatically formulating a query against the (knowledge) graph using a translation/query layer.
  • any text is described using a vector.
  • a workflow or task is also described as a vector.
  • the vector represents the 'sentiment' of the text/graph/task (i.e. generalized information related to the text/graph/task that the vector is describing, such as the subject of the text/graph/task, thereby to allow classification to take place)
  • the vector may be "a vector/information about recruitment", or "a vector/information about cinema tickets”.
  • machine learning is used to associate these vectors. This may allow a vector representing a description of a task to be found to be similar to a vector of a graph that describes the information in that task.
  • a web browser extension 1 may be used to capture digital information (e.g. the content, URL, etc.) from an information source such as a web service (such as a website) 2 accessed by a user performing a task. Captured digital information is sent ( ⁇ ) to the knowledge extractor 3 and sent ( ⁇ ) to the workflow extractor 4.
  • the knowledge extractor 3 returns all the entities and relations found within the content of the webpage 2 just captured.
  • the workflow extractor 4 creates or updates user workflows (i.e. a sequence of actions forming a task). Extracted information is stored in graphical form (i.e. a knowledge graph or knowledge base) in a data store (not shown).
  • Extracted knowledge (in particular knowledge that relates to the current workflow) is presented to the user via an output 5, such as a task card which is part of a user-interface.
  • the browser extension may push the current task card towards the user (e.g. by displaying the task card as a 'pop-up' on the user interface), for example to notify the user that updated information is available to view.
  • the user may pose questions in (pseudo) natural language about the task card 5 in a query field 6 of the task card 5.
  • a translational parser converts the question into a graph query against the current subgraph.
  • the answer presented to the user consists in any matching data in the data store and is appended to the current task card 5.
  • the user may review the task card of previous workflows 7, and may provide search criteria to filter the list of previous task cards.
  • the information presented to the user is then updated accordingly.
  • the information retrieval mechanism (using the knowledge extractor 3 and workflow extractor 4) comprises three main aspects: i.
  • Information capture 200 the retrieval of information from a webpage being accessed by a user, for example the (e.g. HTML) content and URL, preferably together with the actions (e.g. mouse clicks, hovers, etc.) taken by the user when browsing the webpage;
  • Work and knowledge representation 300 the task is represented in terms of the knowledge found during performance of the task, which representation may utilise knowledge graphing and/or neural graph embedding of workflows; and
  • Work assistance 400 the work representation can be queried in a generic way, which may allow applications to be built that can assist the user, for example.
  • Figures 3A and 3B show two complementary illustrations of the system 100, where Figure 3A shows a component view and Figure 3B illustrates an implementation.
  • the knowledge/work representation 300 may be implemented as a knowledge base graph, as in Figure 3, where the links between each user are illustrated.
  • the question answering and fact recommendation components of the work assistance 400 may be implemented using an application sidebar on a web page, as in Figure 3B.
  • Work assistance 400 may also or alternatively take the form of an export to a further service, such as Google (RTM) Sheets, a web based customer relationship management (CRM) software, or custom CRM software.
  • the web services 2 used may include web pages, Github (RTM), Lusha (RTM), custom databases, or other third party software.
  • Figure 3C shows a flow diagram showing the steps of a computer-implemented method 10 of processing information during a user-performed task for use with the system 100, which makes use of the aspects mentioned above.
  • information is extracted from at least one information source accessed by a user during a task (i.e. the information capture 200 aspect is used).
  • relevant information in particular, at least one of an entity and a property associated with an entity
  • the identified relevant information is associated with a stored database of entities and properties (i.e. the knowledge graph) thereby to update the database. It will be appreciated that the second and third steps together make use of the work and knowledge representation 300 aspect.
  • a fourth step 18 in response to a user query related to a particular entity, information relevant to the particular entity is extracted from the database.
  • a fifth step 19 said information relevant to the particular entity is provided to the user.
  • Figure 3D shows a flow diagram showing the steps of a computer-implemented method 20 of classifying a sequence of user accessed websites in accordance with user-performed tasks for use with the system 100.
  • Classifying sequences of websites (or other information sources) accessed by the user may be useful in distinguishing discrete tasks from each other.
  • a first step 22 a sequence of user accessed websites is received.
  • sub-sequences of user accessed websites are classified as particular user-performed tasks using a trained classifier.
  • Figure 3E shows a flow diagram showing the steps of a computer-implemented method 30 of predictive searching of at least one information source for use with the system 100. Predictively searching for information of relevance may improve the utility of a database (for a particular task) by incorporating relevant information without a user's specific input.
  • a first step 32 information from at least one information source accessed by a user during a task is extracted into a database.
  • second step 34 further information that is likely to be of relevance to the task is identified using the information in the database. This further information is not included in the information in the database.
  • the further information is extracted from at least one information source into the database.
  • the further information is presented to the user.
  • structured information may be extracted from semi-structured and unstructured information, thus enabling the system to detect relevant information in various different websites in accordance with a predefined ontology (as described in more detail further on in relation to Figures 5 and 6).
  • a mapping mechanism may be provided for extracting certain information from the HTML structure of certain webpages, by creating a template for a given webpage in which all the different parts of the webpage are tagged such that when a user accesses that webpage the relevant information in those tagged parts can easily be identified and extracted by the mapping mechanism.
  • a web browser extension may be created that allows different HTML objects in a given webpage to be tagged to represent different entity types, and, optionally, also to associate certain relations between the tagged entities. Once a webpage is tagged, the web browser extension may be used as the mapping mechanism that collects the entities and relations from any webpage having the same structure.
  • the mapping mechanism can, however, only be utilized as long as there are webpages which have the exact same HTML template; when a webpage is accessed for which a template does not exist (e.g. which webpage has not been mapped), then only free text may be extracted from that webpage.
  • NERs Named Entity Recognizers
  • Such NERs are models which find entities based on textual context and patterns in free text. For example, a relation may be extracted if two or more entities are found in the same sentence along with a few other restrictions such as that the distance between the entities cannot exceed a certain limit.
  • the extracted entities and relations are all allocated a confidence ranking.
  • an additional step is used to verify the entity. This step could be, for instance, checking that an entity of type "Person" is present in a database of person names. If an entity is verified then there is an increase in the confidence of that entity.
  • the entities and relations are sent to the database along with their entity types and confidences. ii. Work and knowledge representation (300)
  • Figure 4 shows the knowledge/workflow representation 300 aspect of Figure 3 in more detail, in particular showing the general conceptual model used to represent users, workflows/tasks and the information in those workflows as a graph.
  • An exemplary summary of the concepts i.e. the nodes in the graph
  • the relationships between them e.g. the directed edges in the graph
  • Each concept is independent from other concepts (except for MANAGER that is also USER).
  • Each node is accompanied by a curly bracket in which are listed common properties that may be tracked for that concept. Where two nodes are connected by an edge, a relationship exists between those concepts.
  • Edges have names (and sometimes also properties); for instance: workflows have an ident, an initial timestamp and a final timestamp; workflows belong to users (which have an ident, an anon_key, a username and a text), relate to a category (which has a text) and include pages (which have an event_ident, a timestamp, a page_ident, a URL, a domain and a title).
  • Figure 5 shows the many specific types of entities and relationships of Figure 4 organised as a domain taxonomy (or ontology), which in particular shows the entity hierarchy. All of the related are IS_CHILD_OF and represent specializations of entity types i.e. an ADVISOR is a specialization of a PERSON, which is a specialization of an INDIVIDUAL. Structured extraction finds entities and relations of specified type on webpages in specific domains (e.g. parts of the website), and the mapping mechanism that allows such extraction is organised into a taxonomy of entity types. In such data structure a child entity is a more specific concept than the parent entity (IS_CHILD_OF relation); it retains all the features of the parent type and possibly adds more.
  • All the entity types inherits the properties "text”, "confidence”, “surface form” and “source” from the base ENTITY type.
  • a child entity may inherit from different parent entities and, generally, sibling entities are not disjoint (e.g. a PERSON might be an ADVISOR and a DIRECTOR), but can be made explicit by means of the DISJOINT_WITH relation (e.g. an INDIVIDUAL is either a PERSON or a COMPANY).
  • the entity types shown in Figure 5 are from the recruitment domain, but it will be appreciated that many other domains (having different or adapted entities) may alternatively or additionally be used.
  • Figure 6 shows the taxonomy of Figure 5, showing the conceptual relations that can exist between entities.
  • an ADVISOR has a ADVISES relation to a COMPANY entity.
  • Relations extracted via structured extraction can be applied on the same taxonomy graph to explicit the relations between entities. If a relationship exists between two entities, all the child entities of the tail entity (the entity from where the relationship goes out) potentially retain the same relationship toward any child entity of the head entity (the entity where the relationship comes in) unless otherwise specified (i.e. an INDIVIDUAL might INVESTSJN a COMPANY, but also a COMPANY might INVESTSJN an ACADEMY because COMPANY is a subclass of INDIVIDUAL and ACADEMY a subclass of COMPANY.
  • Unstructured extraction might find more relationships by parsing the FREE TEXT found in the pages in the given domain; if so, these relationships are added to the network of relationships in the previous picture. It should also be noted that unstructured extraction will eventually build a universal taxonomy of concepts and relations that will bridge over domains. Domains can coexist, too, so depending on the needs of a user, more taxonomies might be merged in a higher taxonomy to address these needs.
  • Figure 7 represents a map of websites, in which each accessed website is mapped to a vector (thereby creating document embedding vectors), whereby to create the map of websites for use in determining a user's workflow.
  • a particular task 11 is represented as a path on the map of websites, showing a user's progress between websites.
  • the points indicated on the path (labelled 1 , 2, 3, 4) represent a sequence of visited websites.
  • RNN Recurrent Neural Network
  • the RNN is trained to classify sequences into predetermined communities and to determine how well a sequence belongs to a particular classification (for example by evaluating a confidence level).
  • the sequence of all visited websites is thereby divided into sub-sequences that remain within a community boundary.
  • the communities found may be used as labels for the tasks, as indicated in the 'key' for Figure 7, for example.
  • the RNN receives an input of labelled sequence of websites (i.e. vectors representing the websites themselves, as well as the order in which the user accesses the websites) to build an internal representation of the vectors.
  • the final hidden state of the RNN is projected onto a "workflow embeddings layer" (i.e. a characterization of a workflow in terms of websites visited).
  • This last layer is then projected onto the classification vector thereby to classify the websites into e.g. classes A, B, C, D...
  • the classification vector is a list of probabilities stating the confidence that the sequence belongs to a specific class.
  • the RNN thereby learns a representation for the entire sequence (the workflow embedding, which can be compared for similarity with other workflow embeddings by using the dot product), and a classification for a particular sequence (according to the predefined communities).
  • Websites (documents and text) are transformed from HTML into vector representations, which are passed to the RNN and used to classify the websites related to the vector representations.
  • the incoming sequences are provided continuously (i.e. a continuous input of websites visited is fed into the RNN).
  • the RNN breaks sequences into subsequences and/or joins subsequences as appropriate in order to improve classification quality.
  • An output of the RNN is a classification vector for a subsequence and/or a particular website.
  • the perplexity of the classification vector can be taken as a measure of the classification quality.
  • the detection of a start and/or end of a workflow may trigger further actions in the system. For example, when a workflow is ended and/or a different workflow started, the task card/knowledge base for the recently ended workflow may be completed and processed for later retrieval. iii. Work assistance (400)
  • Figure 10 shows one possible output of the system for work assistance, which is in the form of a "task card" user-interface providing information retrieved from the data processing system that relates to a current task being performed on a web browser by the user.
  • the task card is part of the web browser extension, which retrieves information from webpages accessed by the user and transmits that information to the data processing system.
  • the information presented in the example shown represents a Company Research task performed in respect of 'University of Town'
  • the task card provides information on persons of interest related to University of Town, information on their expertise, location information, linked organizations, and skills.
  • the task card in Figure 10 provides the following information: • What the task was about (University of Town in this case)
  • Sources of data from which information items are acquired e.g. logos of websites for which a mapping mechanism exists (not shown in Figure 10).
  • the user is also able to add notes to the task card manually or highlight text on any page and add it to the task card.
  • FIG. 1 A screenshot of an instance of the taxonomy populated with the data from a workflow (i.e. Company Research) about University of Town is shown in Figure 1 1.
  • This interface allows a user to view a representation of the relevant knowledge graph, and therefore may be referred to as a 'graph viewer'.
  • This knowledge graph shows the complexity of numerous entities and relationships identified and captured while performing the workflow. Every information entity extracted from each webpage is shown, together with all of the relations between each entity with the page and each other. This is a simplified visualization of the knowledge base graph structure. Entity types (e.g. person, company, location, etc.) can be filtered in this view. Also, every entity has a 'score' that represents the confidence that it is accurate, its connectedness in the workflow and how prevalent it is across all workflows. Entities can also be filtered by this score. The complexity of workflows is clearly demonstrated by the graph viewer.
  • the knowledge base can be interrogated by means of an open interface in pseudo-natural language that assists the user in building the questions that they want to ask to the system, as shown in Figure 12.
  • the information retrieval mechanism is triggered (1) when a certain keyword is typed into a search query input box (e.g. "what", "which", etc.) (2).
  • the mechanism reads all the entity types (e.g. "artefact", "company” and “individual") from the taxonomy and populates a (preferably, pop-up) menu of options (3) from which the user can select the main concept of the question.
  • the mechanism retrieves all the relations and entity types that can be reached from the current entity and organises them in a menu (4) from where the user can select how to continue the question.
  • the previous target entity (“invests in academy”) then becomes the current entity (6). This step can be executed zero or more times.
  • the menu also contains a special item "with text... " (5) after which the user can specify the text associated with the entity instance of interest (e.g. "University of Town”) (6).
  • the question built so far is passed as input to a parser that utilises a bespoke translational grammar built automatically by the taxonomy to convert it into a database (DB) query. If the parser detects a semantical error, it prompts possible corrections to the user and waits for further input; otherwise, the mechanism formulates answers as output, as described further on.
  • DB database
  • the parser detects a semantical error, it prompts possible corrections to the user and waits for further input; otherwise, the mechanism formulates answers as output, as described further on.
  • the menu is also optimised by looking at the data currently available in the search space; if any specific relation/individual couple is available from the current position its item is removed from the menu.
  • the concepts in which the entities gathered by a user are organised are also used to index the instances in those classes for retrieval purposes.
  • the search might be bounded to the user's current unit of work, the user's past work, or all the past work done by the user's team.
  • the data that match the search criteria is sorted by team (if applicable), unit of work (if applicable) and by centrality/TF-IDF to be ranked.
  • the most relevant data (if any) is added to the output (e.g. task-card). If the graph does not contain yet an answer, a placeholder is added to the output, which will be automatically replaced by the answer when it becomes available.
  • Figure 13 shows a dashboard user-interface comprising a collection of task cards (representing different or overlapping knowledge graphs) a user has accumulated while using the system.
  • a timeline shows the websites and searches made during the workflow. Users may organise their task cards into folders / projects / task types.
  • a search interface allows a user to find a specific task card using key words.
  • the dashboard may also be used to access knowledge graphs corresponding to task cards. All created knowledge graphs are stored into a database for retrieval at a later date.
  • Figures 14A and 14B illustrate how information may be captured during a user-performed task, and that information used to enhance the task.
  • the user has accessed a particular webpage that lists certain information about the person who is the subject of that particular workflow (e.g. a research task for potential recruitment).
  • the system e.g. the web browser extension
  • Figure 15A shows a knowledge graph comprising the captured information on the various entities and relations relevant to the subject person in Figures 15A and 15B. It can be seen that there are three main entities, about which the other entities are interconnected.
  • the task card in Figure 15B provides a convenient user-interface for a user to be presented with information, which may be previous information stored from a previous task that is retrieved if it is relevant to a current task being performed by the user.
  • Figures 16A and 16B shows how the stored information that is captured during a user- performed task may conveniently be used to autofill text fields. In this example, the system is being employed in a recruitment application.
  • the system has identified information from the webpage accessed by the user, where the information relates to the required fields, and has retrieved the required information and presented it in the task card.
  • an autofill function on the task card has been used to complete the required fields in the form using the information presented in the task card.
  • Figure 17 shows another example of the system capturing information from a webpage accessed by the user, this time the information being a telephone number captured from an email and sent to the server, in addition to being presented in the (updated) task card.
  • Figure 18 shows an example of a mobile computing device on which a user has accessed the task card of Figure 17.
  • the mobile computing device is also a mobile telephone device, which is now presenting the user with the option to reach the subject person by calling the telephone number previously captured from the email (in Figure 17).
  • Other possible applications (i.e. the work assistance 400 aspect) of the system include (but are not limited to) the following:
  • Task/Work management to manipulate and organise the work being captured (e.g. being notified of the previous work performed by a user, or a colleague for example, when starting a similar task or event)
  • Fact recommendation to assist with task completion e.g. auto-filling the information found into a task into a form, email, content management system/database, report, etc.
  • Action recommendation to suggest actions based on the current work (e.g. send an email to a person being researched)
  • the system may be arranged to communicate with a third party service thereby to receive input data and/or provide output data.
  • Another aspect of the invention is the provision of a related-entity search / predictive search engine.
  • a knowledge graph is built by the system, consisting of entities and relations from at least one web page. This knowledge graph represents a summary of the "present" task and information need, and so may be referred to as a 'present graph'.
  • Figure 19 shows a schematic representation of a present graph 170, where the nodes represent entities.
  • a primary entity 171 (or entities) can be determined. This is an entity representative of the whole task and given relative importance amongst the other entities.
  • the primary entity can be determined by one or more of:
  • the user's and/or the user's teams previous workflows may be accessed. For example, with reference to the example shown in Figures 10-18, a user may have done some research on James Smith, and then a week later they do some research on his employer, ACME. While researching ACME, it would be helpful for the user to be reminded of what they previously learned about James Smith and also show how it links to the current information about ACME.
  • Figure 20 shows a schematic representation of a 'past graph' 180 made up of knowledge graphs from several workflows having a common entity 171.
  • the determined primary entity is used to find workflow graphs in the user's workflow history.
  • the user's workflow history is queried to determine whether the primary entity is present. If the primary entity is present in any of the past workflows, the graphs for those workflows are retrieved. These graphs are then aggregated with the present graph to form a wider graph, which may be referred to as the 'past graph'.
  • the past graph contains all of the entities that have previously been found to have some association with the primary entity.
  • other methods may be used for determining related workflows other than using a primary entity, for example, a measure of the similarity of workflows may be used, or the websites contained in the workflows may be compared.
  • Figure 21 shows a pipeline for constructing the past entity search using the components of the system described with reference to Figure 2.
  • the pipeline consists of the following steps: i. Extract entities from the current webpage 2 using the knowledge extractor 3 and place them into the current workflow (or present graph) ii. If there is a primary entity 171 (such as a Google search query), extract it and extract any entities from it. In the example shown in Figure 21 , the primary entity is 'John Smith'. iii. Using the primary entity, search the user's knowledge graph of previous workflows and find a list of all workflows the user has previously created containing the primary entity. Combine this information into a past graph. iv.
  • a primary entity 171 such as a Google search query
  • weights provide entities that are relevant to the current task, related to the primary entity, have a relatively high degree of confidence, and are contextually relevant to the user.
  • the entities are ranked using this combined weight and then the highest ranking entities are returned to the user, for example via the browser extension 5.
  • the browser extension may display a message indicating that a user's colleague has performed research about the primary entity or about another primary entity within the last week.
  • the past graph may be used to provide USER-based or TEAM-based recommendations.
  • the system Given current WORKFLOW, the system identifies the CATEGORY and the ENTITIES that are closest to the primary ENTITY of the WORKFLOW. The system finds other WORKFLOWS (from the user, or from TEAM members) that PERTAIN to the same CATEGORY and/or are about any of the close ENTITIES identified above. This information is ranked by the number of connections to the initial WORKFLOW. The information ordered by relevance may be displayed on a side panel to give the user a selection of data with which to complete the current task or to suggest new actions.
  • Figure 22 shows a schematic representation of a 'future graph' 200 made up of entities found from 3rd party data sources such as the web, an API or a database.
  • 3rd party data sources such as the web, an API or a database.
  • data sources may be 'scraped' (i.e. data is extracted from human-readable output) to acquire the entities for use in a future graph, as will be described later on.
  • the primary entity is used as a way to find entities for the future graph, or other mechanisms depending on the data source and the information in the current graph.
  • Data is acquired from third party sources using a service layer (which may also be used to provide context- relevant information for work assistance, as previously described) configured to access such sources.
  • Example data sources include GitHub (RTM) and Google (RTM) Docs.
  • the future graph thereby acts as a kind of predictive search engine, requiring no active user input.
  • This 'predictive searching' capability is accomplished by initially understanding what the user's current task is, and hypothesising what the next task will be. For instance, if the user's current search query has already been satisfied, the next query can be estimated. The entities in the current and previous workflows can be used to make an accurate guess, and the current task type can be used to guess the next step. For example, if the user is performing a recruitment task then it is likely that the user will want information from Linkedln (RTM) next. As the system knows what information the user has found so far, how it connects to the user's history of tasks, see which pieces of information are the most important in this task (using the past search functionality described above) and use that as a basis to search over Linkedln for more information.
  • RTM Linkedln
  • a pre-emptive search can be performed.
  • At least three types of data source will be used: a API - Many web services have an API. They usually take some input and return data from that service. The most relevant entities from a user's workflow are extracted, matched to the inputs of the service, and data is retrieved from the service.
  • CRM Customer Relationship Management
  • Database Database
  • Figure 23 shows a schematic representation of a 'super graph' 210.
  • the previously described graphs are combined to form a super graph, which is made up of all of the information related to the current task (and primary entity).
  • the information in the super graph comes from the current tasks, tasks that the user has done in the past, and information they may wish to find in the future. In theory, this graph should contain all of the information that the user will need given the context of their current task.
  • the super graph is constructed so that we can perform entity search.
  • a super graph may contain thousands of entities, many of which are only tangentially related to the task. To resolve this, the entities are weighted and ranked so that we can determine which of the entities are contextually the most related to the current task (and/or primary entity) and use them accordingly.
  • the information in response to user's activities, without a specific user input).
  • the information may also be accessed in response to a specific user request, for example via a question in natural language format (as described earlier).
  • the information is generally presented as part of a task card (as previously described) or other user interface element.
  • Figure 24 shows a schematic representation of a data wrapper for importing data from third parties into the system.
  • the wrapper comprises an input script to allow the system to query a third party API based on identified relevant entities in the knowledge base.
  • the third part API may then export data to the system via an output script of the wrapper, thereby adding new entities into a knowledge base.
  • the wrapper is thereby a translation layer from the context of a user (their task and history) to a service a third party can provide.
  • the taxonomy can guide the automatic collection of statistics, including (but not limited to) the following:
  • Figure 25 shows information flows into and out of the system 100.
  • the core of the system is shown as a "service system" 261.
  • the service system receives user actions and a stream of events and documents, and provides an assistive response.
  • the system 100 supplies document text to an information extraction component 262 and receives corresponding extracted information.
  • Such information, together with detected user events, may be supplied to a task manager 263, which identifies tasks (as previously described) and creates "task slots", as will be described.
  • the service system 261 is also configured automatically to receive input from third parties 266 via the internet 265 and a service integration component 264 using its "predictive searching" capabilities, as previously described.
  • the system 100 described herein is accordingly capable of operating as a Service Arbitrage layer between the user and third party services they already use or can use. In this way, the system automatically manages queries to relevant 3rd party services that it estimates will help resolve the user's intent.
  • the system 100 is arranged to examine both the user's document set and actions taken to determine which, if any, tasks can be understood as such. If the combination of documents and actions being undertaken can be understood by the system as a task then it can begin servicing the user's intent. It does so by translating that task information into requests that can be understood by third parties.
  • the system 100 provides an assistive response to the user that can be, but is not limited to, a combination of information the system has organised/inferred, information from third parties, and actions that can be taken on third parties through the system.
  • Task Manager 263 At the core of the system service 261 is the Task Manager 263. It is the component of the system that decides task boundaries using all information and signals available. This includes the semantic information extracted by the information extraction 262 from documents/websites and both the implicit and explicit actions the user takes when interacting with that information. If this combination of information and interactions can be understood by the system then the task manager 263 creates a task slot. Task slots represent defined tasks that the system can service request for (e.g. a recruitment task about John Smith). Once it has been determined that a serviceable task is underway, the system is able to act on both internal data and that from third party data providers 266. In order to connect to a third party a service integration 264 component is used which connects to the third party service providers 266 via the internet 265.
  • a Service Data Transformation is created. This implies the ability to transform data between the internal representation of a task and the format required by the third party. It also implies a degree of Service Discovery, the capability to determine which services a given third party exposes can be used for the task at hand (e.g. the github service is queried if the user requires programming information).
  • a web-socket channel is created and stored in a first database 253.
  • a tracking event job is created and stored on a second database 254.
  • Each job is queued for extraction processing on the task queues 255.
  • the job is transferred to the extraction consumer pool 256.
  • Extraction workers use monolithic extraction services (such as NLTK, Spacy, and Gregory) 257.
  • Extractor services return entities to the extraction consumer pool 256.
  • Extractor worker stores triples in a third (graph) database 258.
  • Each job is queued for task extraction.
  • Task Extractors pull each job from the task queues 255 and determine serviceability on the task extraction consumer pool 259.
  • Task Extractors transform task data and query third parties 260.
  • IDL integration description language
  • the IDL comprises: an IntegrationClient 271 ; a ServiceSubscription 272; a ServiceSubscriptionFactory 273; a DataService 274; a ServiceTransform 275; a DataRepository 276; a TokenBearerServiceSubscription 277; and a OauthServiceSubscription 278.
  • IntegrationClient 271 In order to arbitrate requests on a user's behalf, services often require brokers to identify/authenticate themselves during service requests. To do so they transmit credentials to the system that are transmitted when making requests. Such credentials, along with general configuration are stored in this class.
  • ServiceSubscription 272 In order to arbitrate requests on a user's behalf users are required to authenticate themselves through a given service. OAuth 278 and token 277 authentication are the two primary ways of doing so and each results in a second set of credentials that the system can use during requests on behalf of the user. The concrete inheriting classes store these secondary credentials.
  • ServiceSubscriptionFactory 273 Mainly used to decouple authenticating logic from service specific logic and to generate an authenticated service for a specific task slot.
  • DataService 274 This is the implementing interface for all services. It contains all logic for querying a given service. The structure follows the commonplace CRUD (create, read, update, delete) framework, with the repository specific methods being implemented in the DataRepository class 276. Because all services at least implement the read method there is a common interface for querying all sources a user is registered to at runtime.
  • CRUD create, read, update, delete
  • ServiceTransform 275 This class provides a description of how data for a given service can be transformed to and from graph. Implementing classes are able to use the ontology to define how the service format can be changed to triple format. This allows for storage on the graph, and thus, the deduplication, disambiguation, and reasoning of data coming from third party services.
  • Service Discovery The naming of implementing classes of the DataService 274/DataRepository 275 interface allows for services to declare what task slots they can fill. For instance, if a given repository, say Companies House, contains data about companies, then an implementing class would be called CompaniesHousePeopleRepository. In this way a given task slot can be dynamically associated to a task slot.
  • definitions created in ServiceTransform 275 classes are strictly checked. This allows for programmatic definitions to be generated or inferred. This significantly reduces development time as suggestions and changes can be done automatically.
  • the system self adapts in time.
  • the service discovery mechanism works for a set of pre-defined tasks, however is inflexible to creating new task slots over time. For this a more nuanced mechanism of dynamic association may be used. Fingerprinting
  • Figures 28 shows a graph of page visits for a user. Each time a user visits pages, their activity is recorded as a graph on the knowledge base 300. This graph of user activity is useful for characterising user behavior.
  • Figure 29 shows a vector transformation ('fingerprint') of the graph of Figure 29.
  • the vectors within this transformation preserve the distance between graphs: similar graphs having vectors close to each other.
  • an additional dummy node 291 (with text "TOP") is added to the graph that is connected to all the other nodes.
  • the similarity between graphs is computed by making a neural net predict the user that generated a specific graph.
  • the architecture is shown in Figure 30.
  • the input to this neural net is the Doc2Vec embeddings and the adjacency matrix of the relevant graph and the output is the logits of the user IDs.
  • the graph embeddings can be found in the last layer of the network.
  • FIG 31 shows an example of a computer device suitable for implementing the system 100 (at least in part).
  • the computer device 1000 comprises a processor in the form of a CPU 1002, a communication interface 1004, a memory 1006, storage 1008, removable storage 1010 and a user interface 1012 coupled to one another by a bus 1014.
  • the user interface 1012 comprises a display 1016 and an input/output device, which in this embodiment is a keyboard 1018 and a mouse 1020. In other embodiments, the input/output device comprises a touchscreen.
  • the CPU 1002 executes instructions, including instructions stored in the memory 1006, the storage 1008 and/or removable storage 1010.
  • the communication interface 1004 is typically an Ethernet network adaptor coupling the bus 1014 to an Ethernet socket.
  • the Ethernet socket is coupled to a network.
  • the memory 1006 stores instructions and other information for use by the CPU 1002.
  • the memory 1006 is the main memory of the computer device 1000. It usually comprises both Random Access Memory (RAM) and Read Only Memory (ROM).
  • the storage 1008 provides mass storage for the computer device 1000. In different implementations, the storage 1008 is an integral storage device in the form of a hard disk device, a flash memory or some other similar solid state memory device, or an array of such devices.
  • the removable storage 1010 provides auxiliary storage for the computer device 1000.
  • the removable storage 1010 is a storage medium for a removable storage device, such as an optical disk, for example a Digital Versatile Disk (DVD), a portable flash drive or some other similar portable solid state memory device, or an array of such devices.
  • the removable storage 1010 is remote from the computer device 1000, and comprises a network storage device or a cloud-based storage device.
  • the system 100 is implemented as a computer program product, which is stored, at different stages, in any one of the memory 1006, storage device 1008, and removable storage 1010.
  • the storage of the computer program product is non-transitory, except when instructions included in the computer program product are being executed by the CPU 1002, in which case the instructions are sometimes stored temporarily in the CPU 1002 or memory 1006.
  • the removable storage 1008 is removable from the computer device 1000, such that the computer program product may be held separately from the computer device 1000 from time to time.
  • the computer program product may also or alternatively be distributed, such that only certain aspects of the computer program product are stored and/or implemented via the computer device.
  • the user may use the communication interface 1004 to access information sources using the internet, which may be incorporated into a database/graph held in storage.
  • the database/graph may be saved remotely, for example via a "cloud server", in which case the computer device is effectively used as a controller for the system.
  • a user telecommunication device such as a "smartphone" may be used.
  • the system may be arranged to dynamically present newly acquired relevant information to the user, as previously mentioned, and in addition by contextually provide current information in response the user's current task/workflow.
  • the data fields that a user sees in a task card may dynamically change in response to the user's current task/workflow.
  • the present invention (and in particular, the related entity search/predictive search engine aspects) has generally been described with reference to a research ask, particularly in the field of recruitment, it will be appreciated that the invention may be applied to any field in which a user acquires information via the internet and/or one or more databases.
  • the system may be able to assist a user in baking a cake by capturing information related to various alternative recipes and ingredients, reminding the user about previously researched recipes, and predictively suggesting new recipes.
  • the present invention has been described above purely by way of example, and modifications of detail can be made within the scope of the invention.
  • any feature in a particular aspect described herein may be applied to another aspect, in any appropriate combination.
  • particular combinations of the various features described and defined in any aspects described herein can be implemented and/or supplied and/or used independently.

Abstract

La présente invention concerne un procédé mis en œuvre par ordinateur de traitement d'informations pendant une tâche effectuée par un utilisateur, le procédé comprenant les étapes suivantes : extraire des informations d'au moins une source d'informations à laquelle un utilisateur accède pendant une tâche ; identifier une entité et/ou une propriété associée à une entité desdites informations extraites ; associer l'entité et/ou la propriété associée à une entité identifiée avec une base de données stockée d'entités et de propriétés pour ainsi mettre à jour la base de données ; en réponse à une requête d'utilisateur concernant une entité particulière, extraire des informations pertinentes pour l'entité particulière à partir de la base de données ; et fournir lesdites informations pertinentes pour l'entité particulière à l'utilisateur.
PCT/GB2018/051935 2017-07-07 2018-07-06 Capture et extraction d'informations numériques WO2019008394A1 (fr)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
GB1710995.0 2017-07-07
GBGB1710993.5A GB201710993D0 (en) 2017-07-07 2017-07-07 Digital information capture and retrieval
GBGB1710997.6A GB201710997D0 (en) 2017-07-07 2017-07-07 Digital information capture and retrieval
GB1710997.6 2017-07-07
GB1710993.5 2017-07-07
GBGB1710995.0A GB201710995D0 (en) 2017-07-07 2017-07-07 Digital information capture and retrieval

Publications (1)

Publication Number Publication Date
WO2019008394A1 true WO2019008394A1 (fr) 2019-01-10

Family

ID=62976081

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2018/051935 WO2019008394A1 (fr) 2017-07-07 2018-07-06 Capture et extraction d'informations numériques

Country Status (1)

Country Link
WO (1) WO2019008394A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651989A (zh) * 2020-04-13 2020-09-11 上海明略人工智能(集团)有限公司 命名实体识别方法和装置、存储介质及电子装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080109285A1 (en) * 2006-10-26 2008-05-08 Mobile Content Networks, Inc. Techniques for determining relevant advertisements in response to queries
WO2013126808A1 (fr) * 2012-02-22 2013-08-29 Google Inc. Entités associées
US20150106157A1 (en) * 2013-10-15 2015-04-16 Adobe Systems Incorporated Text extraction module for contextual analysis engine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080109285A1 (en) * 2006-10-26 2008-05-08 Mobile Content Networks, Inc. Techniques for determining relevant advertisements in response to queries
WO2013126808A1 (fr) * 2012-02-22 2013-08-29 Google Inc. Entités associées
US20150106157A1 (en) * 2013-10-15 2015-04-16 Adobe Systems Incorporated Text extraction module for contextual analysis engine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Graph database - Wikipedia", 3 July 2017 (2017-07-03), XP055503933, Retrieved from the Internet <URL:https://en.wikipedia.org/w/index.php?title=Graph_database&oldid=788828526> [retrieved on 20180903] *
ANONYMOUS: "Knowledge Graph - Wikipedia", 26 June 2017 (2017-06-26), XP055503980, Retrieved from the Internet <URL:https://en.wikipedia.org/w/index.php?title=Knowledge_Graph&oldid=787595938> [retrieved on 20180903] *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651989A (zh) * 2020-04-13 2020-09-11 上海明略人工智能(集团)有限公司 命名实体识别方法和装置、存储介质及电子装置
CN111651989B (zh) * 2020-04-13 2024-04-02 上海明略人工智能(集团)有限公司 命名实体识别方法和装置、存储介质及电子装置

Similar Documents

Publication Publication Date Title
US20210019341A1 (en) Implementing a software action based on machine interpretation of a language input
US10235681B2 (en) Text extraction module for contextual analysis engine
US9990422B2 (en) Contextual analysis engine
US10430806B2 (en) Input/output interface for contextual analysis engine
JP4920023B2 (ja) オブジェクト間競合指標計算方法およびシステム
US8712990B2 (en) Methods and systems for providing a business repository
US20150142423A1 (en) Phrase-based data classification system
US20220309037A1 (en) Dynamic presentation of searchable contextual actions and data
JP2021529385A (ja) エンティティー間の関係の調査するためのシステム及び方法
US20170103439A1 (en) Searching Evidence to Recommend Organizations
US9069862B1 (en) Object-based relationship search using a plurality of sub-queries
US20150127688A1 (en) Facilitating discovery and re-use of information constructs
AU2016228246B2 (en) System and method for concept-based search summaries
US11416907B2 (en) Unbiased search and user feedback analytics
JP2022505837A (ja) 知識検索システム
US11514124B2 (en) Personalizing a search query using social media
US11269894B2 (en) Topic-specific reputation scoring and topic-specific endorsement notifications in a collaboration tool
US10409866B1 (en) Systems and methods for occupation normalization at a job aggregator
US11886477B2 (en) System and method for quote-based search summaries
Geiger Personalized task recommendation in crowdsourcing systems
Paydar et al. A semi-automated approach to adapt activity diagrams for new use cases
US20150363803A1 (en) Business introduction interface
WO2019008394A1 (fr) Capture et extraction d&#39;informations numériques
JP2020067864A (ja) 知識検索装置、知識検索方法、および、知識検索プログラム
Drăgan et al. The semantic desktop at work: interlinking notes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18743062

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18743062

Country of ref document: EP

Kind code of ref document: A1