GB2589608A - Recommender system for document-based search - Google Patents

Recommender system for document-based search Download PDF

Info

Publication number
GB2589608A
GB2589608A GB1917709.6A GB201917709A GB2589608A GB 2589608 A GB2589608 A GB 2589608A GB 201917709 A GB201917709 A GB 201917709A GB 2589608 A GB2589608 A GB 2589608A
Authority
GB
United Kingdom
Prior art keywords
document
data items
recommender system
named entities
retrieved
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1917709.6A
Other versions
GB201917709D0 (en
Inventor
Blume Till
Lorenz Robert
Simic Ilija
Wiese Michael
Grunewald Paul
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Technische Universitaet Dresden
Ernst and Young GmbH
Original Assignee
Technische Universitaet Dresden
Ernst and Young GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Technische Universitaet Dresden, Ernst and Young GmbH filed Critical Technische Universitaet Dresden
Priority to GB1917709.6A priority Critical patent/GB2589608A/en
Publication of GB201917709D0 publication Critical patent/GB201917709D0/en
Publication of GB2589608A publication Critical patent/GB2589608A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A recommender system comprising a server arrangement configured to receive an input from a user device via a communication network, where the input is a first document in a first format to initiate a document-based search. The server arrangement extracts textual information from the first document independent of permanent storage of the first document or the textual information in the first document. A boundary of each named entity of a plurality of named entities is determined from extracted textual information. The plurality of named entities is classified into a set of predefined classes. A set of data items is retrieved in a plurality of different formats from a plurality of different on line or offline data sources. Display of the retrieved set of data items is controlled on the user device such that visual relevancy of retrieved set of data items with respect to first document is discernible. The server may be arranged to determine co-occurrence and relative position of the plurality of named entities within the first document from the extracted textual information.

Description

RECOMMENDER SYSTEM FOR DOCUMENT-BASED SEARCH
TECHNICAL FIELD
The present disclosure relates generally to search platforms and 5 technologies; and more specifically to a recommender system for a document-based search.
BACKGROUND
With advancements in computer and communication technologies, and data sharing platforms, there has been rapid increase in the amount of published information. For example, an enormous volume of heterogeneous content, such as scientific publications, videos, tutorials, social media posts, etc. is published almost on daily basis. The amount exceeds by far what users may read in their life time. With the rapid increase in the amount of heterogeneous content and different data sources from which such content may be accessed, users typically struggle to be aware of what knowledge is available which may be relevant to their interest. Currently, many search engines and platforms are available to search for relevant content. However, in certain scenarios, a user may have a document and may do not exactly know what words to select from the document to search for, or how to formulate an effective search query. This is more common in explorative tasks such as learning or investigating a new topic. As a result, a large number of documents comprising both relevant and irrelevant information that may be or may not be of user-interest, are typically retrieved and displayed based on the search query, which is time-consuming to visualize, read, and grasp. Moreover, the conventional search platforms and technologies use dictionary-based approaches and rule-based approaches which is time consuming, inefficient, and may lead to irrelevant searches, i.e., are not be suitable when the search input is a document. Furthermore, there are certain conventional methods that use supervised machine learning for entity detection. However, such conventional methods require fully annotated documents and a set of 5 features to train models on large domain-specific text corpora, which is practically non-sustainable keeping in view the enormous volume of heterogeneous content, which are unstructured. Further, deep learning methods typically require a large amount of labelled data for supervised learning and take more time and computing resources to train than the 10 classical machine learning methods.
In certain other scenarios, the conventional search technologies have privacy issues as it requires storage of original document in the system server that may lead to compromise with security of certain documents and may be accessible by an unauthorized person. Additionally, as a result of the limitations associated with conventional techniques, the process involved in search query formulation, and retrieval of documents based on search query is resource intensive. For example, retrieval of large amount of information makes the search and retrieval process, significantly processor cycle and memory intensive. Moreover, irrelevant information occupies unnecessarily high amount of space in a temporary storage device (e.g. RAM) resulting in unavailability of RAM for performing other tasks, which in turn adversely affects the inherent computational capability of a user device.
Therefore, in the light of foregoing discussion, there exists a need to 25 overcome the aforementioned drawbacks associated with conventional recommender systems, search platforms, and technologies.
SUMMARY
The present disclosure seeks to provide a recommender system for a document-based search. The present disclosure seeks to provide a solution to the existing problem of privacy issues associated with document-based search and the problem of irrelevant results when the starting point to search is a document. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provides an efficient, reliable, and quick document-based search platform that avoids permanent storage of the documents used in the search process.
In one aspect, an embodiment of the present disclosure provides a recommender system comprising: a server arrangement configured to: receive an input from a user device via a communication network, wherein the input is a first document in a first format to initiate a document-based search; extract textual information from the first document 15 independent of permanent storage of the first document or the textual information in the first document; determine a boundary of each named entity of a plurality of named entities from the extracted textual information based on a combination of a statistical system and a rule-base system; - classify the plurality of named entities into a set of predefined classes, wherein each predefined class of the set of predefined classes indicates a type of the named entity; retrieve a set of data items in a plurality of different formats from a plurality of different online or offline data sources, based on at 25 least the classified plurality of named entities; and control display of the retrieved set of data items on the user device such that a visual relevancy of the retrieved set of data items with respect to the first document is discernible.
Embodiments of the present disclosure substantially eliminate or at least 30 partially address the aforementioned problems in the prior art, and provide an advanced, secured, and convenient recommender system for the document-based search that operates an end-to-end workflow starting from a document presented for search to final display of the search results for visualization.
Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative embodiments construed in conjunction with the appended claims that follow.
It will be appreciated that features of the present disclosure are 10 susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those skilled in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.
Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein: FIG. 1 is a block diagram of a recommender system, in accordance with
an embodiment of the present disclosure; and
FIG. 2 illustrate exemplary user interface rendered on a user device displaying a set of data items, in accordance with an embodiment of the present disclosure.
In the accompanying drawings, an underlined number is employed to 5 represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the non-underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a 10 general item at which the arrow is pointing.
DETAILED DESCRIPTION OF EMBODIMENTS
The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.
In one aspect, an embodiment of the present disclosure provides a recommender system comprising: a server arrangement configured to: - receive an input from a user device via a communication network, wherein the input is a first document in a first format to initiate a document-based search; extract textual information from the first document independent of permanent storage of the first document or the textual 25 information in the first document; determine a boundary of each named entity of a plurality of named entities from the extracted textual information based on a combination of a statistical system and a rule-base system; - classify the plurality of named entities into a set of predefined classes, wherein each predefined class of the set of predefined classes indicates a type of the named entity; - retrieve a set of data items in a plurality of different formats 5 from a plurality of different online or offline data sources, based on at least the classified plurality of named entities; and - control display of the retrieved set of data items on the user device such that a visual relevancy of the retrieved set of data items with respect to the first document is discernible.
The present disclosure provides aforementioned recommender system for document-based search. The recommender system provides a secured and effective search platform to a user. The disclosed recommender system provides a complete end-to-end service, in which, starting from a document (e.g. a PDF or other text-containing file), named entities are determined, followed by classification in a set of pre-defined classes, such as person, organization and location, and enhanced visualization, where all the operations are done on-the-fly, without storing any information from the original file. In this way, sensible documents, such as contracts or other restricted information, are processed while satisfying privacy requirements. The recommender system integrates a large amount of offline and online resources in a plurality of formats that may include, but is not limited to text documents, videos, images, audios and social media data.
The recommender system provides techniques for efficient and accurate retrieval of data items that are of user-interest to a given user. Moreover, the display of the retrieved set of data items is clutter free and controlled such that a visual relevancy of the retrieved set of data items with respect to the document used for search is discernible. This enables a user to conveniently visualize and quickly grasp the relevant information retrieved from different data sources. For example, the recommender system may use various kinds of resources, including videos, documents, and social media posts, to retrieve and display the set of data items. Furthermore, the recommender system is comparatively less computer intensive and requires less storage space as only small chunk of relevant information is occupied in the storage space of a memory (e.g. a random-access memory). Consequently, random access memory is available for performing other tasks of the server arrangement. The operations of the server arrangement makes the recommender system a better recommender system in the sense of running more efficiently, securely, and effectively as a computer, as the display of the retrieved set of data items is clutter free and controlled such that a visual relevancy of the retrieved set of data items with respect to the document used for search is discernible. This reduces the number of user interactions required to select and visualize a correct document which is relevant to the first document.
The recommender system comprises a server arrangement configured to receive an input from a user device via a communication network, wherein the input is a first document in a first format to initiate a document-based search. Throughout the present disclosure "the recommender system" refers to a system that is collection of one or more interconnected programmable and/or non-programmable components configured to provide recommendations of the set of data items which are intended to be of user-interest for a given user. Examples include programmable and/or non-programmable components, such as processors, memories, network interface, connectors, and the like. In an example, the recommender system provides set of data items such as relevant research documents based on the first document. Optionally, the recommender system provides a search platform that automatically presents relevant data items (e.g. videos and documents) as recommendations in accordance with user-interest of corresponding users.
Throughout the present disclosure, the term "server arrangement" refers to an arrangement of one or more servers that includes one or more processors, such as the processor, configured to perform various operations for the recommender system. As an example, the server arrangement may further include components such as a memory, a network interface, a system bus, and the like, to store and process information pertaining to the search. Furthermore, it should be appreciated that the server arrangement may be both single hardware server and/or plurality of hardware servers operating in a parallel or distributed architecture. The first document in the first format is submitted to the user interface, which is used as input by the server arrangement to initiate the document-based search. For example, a portable document format (PDF) file or another text-containing file may be the first document to begin the search process. The user device may include, but is not limited to a smart phone, a client device, a laptop, a tablet computer, and other computing device. Additionally, the user device includes a casing, a memory, a processor, a network interface card, a microphone, a speaker, a keypad, and a display. It will be appreciated that user refers to any entity including a person (i.e., human being), an organization (i.e. a company, university, and the like), or a virtual personal assistant (an autonomous program or a bot) using the user device and/or system described herein. The user device is communicatively coupled to the server arrangement via the communication network. Examples of the communication network may include, but is not limited to a Local Area Network (LAN), a Wide Area Network (WAN), the Internet. Additionally, the communication network includes wired or wireless communication that can be carried out via any number of known protocols, including, but not limited to, Internet Protocol (IP), Light Fidelity (Li-Fi), Wireless Access Protocol (WAP), Frame Relay, or Asynchronous Transfer Mode (ATM).
The server arrangement is configured to extract textual information from the first document independent of permanent storage of the first document or the textual information in the first document. The server arrangement automatically extracts the textual information from the first document without permanently storing the first document. In other words, the extraction of the textual information from the first document is executed without permanently uploading or storage of the first document to the server arrangement of the recommender system. The extraction of the textual information from the first document is achieved using a background service. For example, the textual information from an image may be extracted by Optical character Recognition (OCR) method that electronically converts image of a document or text in an image into a machine-encoded text. For example, the textual information may be extracted directly from a portable document format (PDF), a word document, or any text-containing file. Additionally, the extraction of textual information is executed in real time or near real time. however, if the first document is uploaded and integrated into the recommender system's database, advanced disambiguation techniques is applied to further improve the disambiguation and security.
The server arrangement is further configured to determine a boundary of each named entity of a plurality of named entities from the extracted textual information based on a combination of a statistical system and a rule-base system. The plurality of named entities may include, but is not limited to name of a person, an organization and a location. The boundary of each named entity of the plurality of named entities is determined by a Named Entity Recognition (NER) method which is also known as entity identification, entity chunking and entity extraction. The NER method is used to locate each named entity for example, person names, organizations, locations and the like in the extracted textual information.
For example, if the input given to the recommender system is "Architect Irving Morrow constructed the Golden Gate Bridge", then the server arrangement may determine two named entities with their boundaries as output, such as "Irving Morrow" and "Golden Gate Bridge". The combination of the statistical and the rule base approaches is used to develop rule set to disambiguate the extracted plurality of named entities.
Statistical approach includes an exploratory data retrieval method which uses predictive models to reveal patterns and trends in the extracted textual information from the first document. Additionally, rule base approach is used to extract the plurality of named entities by setting formal rules. The formal rules may represent a full scientific model or represent local patterns in the extracted plurality of named entities. Thus, statistical approach and rule base approach when combined, results in most relevant and appropriate search. In an example, segregation of each sentence into parts of speech, such as noun, pronoun, subject, verb, predicate etc. This further facilitates identification of named entities.
In an exemplary implementation, a concept language model is employed by the server arrangement to determine boundary of the plurality of named entities. The concept language model is built using a plurality of entities present in thesaurus and/or already available information from different data sources. The concept language model comprises likely characteristic words of a particular entity. For example, the likely characteristic words for the concept "car" are "car", "vehicle", "transport".
A server arrangement configured to classify the plurality of named entities into in a set of predefined classes, wherein each predefined class of the set of predefined classes indicates a type of the named entity.
Classification of the plurality of named entities into the set of predefined classes may be determined by use of a Named Entity Classification (NERC) method. The NERC method is practical and domain-independent to automatically classify the plurality of named entities with high accuracy from the extracted textual information. For example, if the input given to the recommender system is "Architect Irving Morrow constructed the Golden Gate Bridge" then the output of the recommender system may determine two predefined classes as output, such as "Irving Morrow: Person, Golden Gate Bridge: Location".
According to an embodiment, the server arrangement is configured to 5 determine co-occurrence and relative position of the plurality of named entities within the first document, from the extracted textual information. In an example, each of the named entity of the plurality of named entities is associated with a weight. The value of the weight of the named entity is determined based on co-occurrence and relative position of the 10 plurality of named entities within the first document.
According to an embodiment, the server arrangement is configured to calibrate a relationship among the classified plurality of named entities based on at least the determined co-occurrence and the relative position of the plurality of named entities within the first document. The relations of the classified plurality of named entities for an example person, organization or location are measured according to the weights of each entity in the plurality of named entities. For an example, if the terms "java" and "The University of California" may occur multiple times in a document then the weighted relationship of term "java" and "The University of California" occurring in a same sentence is higher (i.e. close co-occurrence) and the term "java" and "The University of California" occurring in different sentence is lower.
A server arrangement configured to retrieve a set of data items in a plurality of different formats from a plurality of different online or offline data sources, based on at least the classified plurality of named entities. The set of data items are the documents that are used by the recommender system platform for the document-based search in the plurality of different formats. For example, the plurality of different formats may include, but is not limited to a word file, a PDF file, videos, audios, image format, and other text-containing file, or tagged files. The first document is compared and analysed with the set of data items for the search by the recommender system.
According to an embodiment, the set of data items in the plurality of different formats comprises at least two of: a text-containing document, a video, an audio, an image file, a social media item, a web-page, or another document format. In the video format, a Visual Analysis Service is used which converts the audio in video into a text file (i.e. a transcript of video). The converted text data using the Visual Analysis Service may be indexed that is easily retrieved by the recommender system.
Moreover, an image file is converted into a text file. An Optical character recognition may be used for conversion of the image file into the text file. The image file may include, but is not limited to a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image. The social media item is extracted from different social network platforms. The social media item may be in a text, a video, an audio format or another format. The web-page format for an example may include, but is not limited to a Uniform Resource Locator (URL) or a Hypertext Transfer Protocol (HTTP).
According to an embodiment, in case of videos that are retrieved as part of the set of data items, instead of retrieval of complete video, only a segment of video (having the relevant concept) that is relevant to the classified plurality of entities in the first document based on their weightage, is retrieved. This makes the retrieval not only efficient, but saves time of a user who is likely to view the retrieved segment instead of the entire video. For example, the highest weighted named entities in the first document (which was used to initiate the document-based search) may be "5G antenna systems" and "millimetre wave communication". Thus, a video that is relevant and to be retrieved on-the-fly may be of one-hour duration having many antenna systems, such as 3G, 4G, and also 5G. However, only the segment of video, for example, starting from 25 min to 35 min, may exactly describe the 5G antenna system capable of millimetre wave communication, thus, only such segment is retrieved or marked for further display and visualization in next step instead of the entire one-hour video. The visual analysis service of the recommender system executes a visual analysis of each relevant video, indexed the transcript as well as visual scenes (e.g. a visual concept) so that an action or a concept in a given shot or scene is tagged automatically, and retrieved.
According to an embodiment, the plurality of different online data sources correspond to a plurality of web-based platforms from which data items are retrieved on-the-fly, and wherein the plurality of offline data sources corresponds to prestored data items in the plurality of different formats in the recommender system. Web-based platforms may include, but not limited to a social media platform, online published journals (for example The Institute of Electrical and Electronics Engineers, ScienceDirect and alike) or search engines. Furthermore, the data items are retrieved from online data sources in real time or near real time. Prestored data items are the documents which are stored in the server arrangement of the recommender system in different formats. For example, different formats of prestored data items may include, but is not limited to a text, a video, an image or an audio.
In an embodiment, the server arrangement is further configured to retrieve the set of data items based on parsing of a title or a combination of the title or an abstract of a data item of the set of data items, in absence of full-text of the data item. Optionally, the full text of a given data item may not be readily available for various reasons such as a need for user login for accessing full text, a need for payment for accessing full text and the like. However, in such a case, the title and/or the abstract of the given data item is readily available.
In an embodiment, when the set of data items are retrieved, related named entities and nnetadata associated with each of the retrieved set of data items, are also acquired. Along with the set of data items, such additional information, such as the related named entities and the associated metadata, are used later for generating visualization.
The server arrangement configured to control display of the retrieved set of data items on the user device such that a visual relevancy of the retrieved set of data items with respect to the first document is discernible. In an example, the retrieved set of data items are displayed on a user interface rendered on a display screen of the user device as recommendation. In an example, the user interface may be a web-based interface or an application interface, such as a widget. The term "display screen" refers to a structure including an arrangement of interconnected programmable and/or non-programmable components that are configured to receive set of data items and present the set of data items. Optionally, the display screen may display additional filers, such as concept terms to enable a user to select certain concepts for filtering of search results. It will be appreciated that a user interface on the display screen enables the user to interact (such as click, read) with the set of data items. Optionally, the user interface comprises a separate section for providing recommendations. The visual relevancy of the retrieved set of data items refers to displaying of the set of data items such that a relevancy of the displayed search results, and first document, is easily discernible and understood from the displayed view. In an example, a list of conceptual terms (i.e. classified and high weighted named entities from the document may be also be displayed on the user interface. A user may select a particular concept from the displayed list of conceptual terms, and the search results already displayed are updated and filtered further on-the-fly without the need to re-run or execute the search process from the beginning.
In an embodiment, along with the retrieved set of data items (i.e. the search results), the related named entities, and the metadata associated with the retrieved set of data items, are rendered on the display screen as a plurality of nodes. A relationship among the plurality of nodes is rendered as links in a node-link representation. For example, the plurality of nodes may include a node displaying a name or a location of the publication of a retrieved data item of the retrieved set of data items, a node representative of a name of author(s) of the retrieved data item, a node representative of an affiliation of the author(s) (e.g. a research institute the author(s) is associated with), a node representative of a date and time of submission or publication of the data item (e.g. time related information associated with the retrieved data item). Optionally, other related named entities present in the retrieved set of documents that are relevant to the extracted and classified named entities within the input document that was used for document-based search are potentially also displayed as one or more nodes. The plurality of nodes is displayed in a form of a visual representation that defines visual relevancy between the input document and each of the retrieved set of documents including related named entities and associated nnetadata.
Moreover, the disclosed recommender system provides an aggregation function and a navigation function. In the aggregation function, the plurality of nodes is potentially aggregated (e.g. grouped or selected) by a user to analyse different attributes and their distribution. Optionally, the server arrangement is further configured to automatically aggregate one or more particular regions of the rendered visual representation (e.g. a visual interactive graph). In the navigation function, the server arrangement allows a user to explore and identify new nodes of interest by using an overview representation of a sub-graph (i.e. a portion of the visual representation) surrounding a particular node. Thus, a user is able to perform a focused and interest-driven navigation through the visual representation based on an interaction model of the rendered visual representation (e.g. a visual interactive graph) that enables visual graph querying and further information discerning depending on node aggregation and relation properties of the node with other nodes. In an example, the display screen enables a user to select one or more nodes of one or more data items to further filter the retrieved data items, or explore and identify new nodes of interest. In an example, the user may select a node that represents specific publication name of a data item. In such an example, new nodes are displayed that links to relevant data items published in the same publication name. In another example, the user may select a node that represents a named entity (or a topic or a concept term) and aggregate with another node that represents an author name. In such an example, relevant data items having same author name with the selected named entity is displayed on the display screen.
Beneficially, the recommender system disclosed in the present disclosure in not restricted to just retrieving the set of data items (i.e. relevant documents) based on the first document (input document). The disclosed recommender system provides a platform to upload his/her own input document (i.e. the first document). The server system of the recommender system then enables comparison of the user provided input document with existing documents. The input document (and potentially the retrieved set of documents in some embodiments) are analysed to extract named entities, followed by finding relationships that lead to extended information within the document provided by the user, and the enhanced and interactive visualization in the form of plurality of nodes provided in the visual representation, as described above, differentiates the disclosed recommender system (which may also be referred to as "MOVING Platform") from existing search engines.
According to an embodiment, in case of videos displayed as a part of the 30 set of data items, a relevant video is played only from a timepoint that is relevant to the first document, instead of playing the video from starting point. Thus, the recommender system is very precise in display of relevant documents that form the part of the retrieved set of data items. Optionally, visual relevancy may be set according to a visual choice of viewing the set of data items by the user. In an example, a user prefers to view set of data items in a chronological order such that videos are displayed first, text-containing document are displayed second, image files are displayed third, audios are displayed fourth, social media items are displayed fifth and web-pages are displayed sixth.
Optionally, the user interface comprises a first section to enable the user to search for information by entering a search query, a second section to perform a document-based search, and a third section where recommendations are presented to the user. In an example, the user interface comprises a plurality of user interface elements, such as button, options selectors, text box, and the like to enable user interaction. Moreover, the set of data items are retrieved from the online and offline data sources of the recommender system. Advanced visualization techniques are used to display the retrieved set of data items on the user device which enables to recognize between the plurality of key named entities from a client document and their presence and relevance in the set of data items. For an example, a concept graph may be used to visualize the plurality of extracted and classified named entities. The concept graph easily identifies which entities are extracted from the client document, which entities are identified in the retrieved set of data items.
In a case of discrepancies in display of the search result, it is recommended to change the setting of the recommender system and implement the reconnmendary system again.
According to an embodiment, the recommender system is configured to handle receipt of document, and execution of search process, and display 30 of results in different languages. The recommender system is pre-trained, for example, in German and English language models. In another embodiment, the recommender system is configured to display search results in a combination of two different languages at a same time or different point in time, based on prespecified user-preferences.
The recommender system has many real-life applications, such as an effective educational search and learning platform having an explore functionality, where a user is able to explore a new topic starting from a document even if the user do not exactly know what words to select from the document to search for, or how to formulate an effective search query. An example of a real-life implementation of the disclosed recommender system for the document-based search is with respect to the International Standard on Auditing (ISA) 550. In an example, the document-based explore functionality of the recommender system may be used to understand and evaluate related parties disclosed by an entity (e.g. a client entity). Typically, there are many related party transactions in a normal course of a business. In such circumstances, the related party transactions may carry no higher risk of material misstatement of the financial statements than similar transactions with unrelated parties. However, in some circumstances, the nature of related party relationships and transactions may give rise to higher risks of material misstatement of the financial statements than transactions with unrelated parties. For example, related party transactions may not be conducted under normal market terms and conditions. In addition, related party transactions may operate through an extensive and complex range of relationships and structures, with a corresponding increase in the complexity of related party transactions (e.g. fulfilling the requirement of ISA 550 paragraph 1). In such scenarios, the recommender system having the document-based explore functionality supports an auditor in obtaining an understanding of related party relationships and transactions to: (a) recognize fraud risk factors, if any, arising from related party relationships and transactions that are relevant to the identification and assessment of the risks of material misstatement due to fraud; and (b) to conclude, based on the audit evidence obtained, whether the financial statements insofar as they are affected by those relationships and transactions (e.g. fulfilling requirements of ISA 550 para. 9).
In an example, using the document-based explore functionality of the recommender system, the auditors are able to upload a document, for example, the annual report, the list of shareholdings, or the notes to the financial statements. After uploading the document, the persons, organizations, and locations ("entities") present in the document are automatically extracted by the recommender system. Simultaneously, the relations between the classified named entities (i.e. persons, organizations, locations) are measured and specified. After the extraction is completed, the entities are visualized via the concept graph.
The relationships are weighted considering the co-occurrence and relative position of the entities within the document by the recommender system on-the-fly. Further, to analyse the completeness of related parties disclosed by the client, an auditor may use the concept graph functionality to identify which entities have been extracted from the client's document, which entities were identified in the databases communicatively coupled to the recommender system and which entities are evident in both the document and the recommender system (i.e. search platform). In exploring the remaining nodes step-by-step, a user (e.g. an auditor in this case) can easily identify related parties that are included in the client's file but not identified in the database (e.g. a custom database) of the recommender system, and vice versa. When assessing the reasons behind any discrepancies identified, the filer setting on the user interface rendered by the recommender system may be changed to show documents and keywords again. Thus, the recommender system is not only effective, efficient, but ensures privacy of documents even using the online processing by recommender system, which process the document without uploading the document to the platform, i.e., without permanently storing it.
DETAILED DESCRIPTION OF THE DRAWINGS
Referring to FIG. 1, illustrated is a block diagram of a recommender system 100, in accordance with an embodiment of the present disclosure. As shown, the recommender system 100 comprises a server arrangement 102. The recommender system 100 is connected to a user device 104 via a communication network 106.
It will be appreciated that for sake of simplicity and clarity, the server arrangement 102 is shown to include a single server. However, the server arrangement 102 can also include a plurality of servers. It will be appreciated that FIG. 1 is merely an example, which should not unduly limit the scope of the claims herein. It is to be understood that the specific designation for the network environment is provided as an example and is not to be construed as limiting the network environment to specific numbers, types, or arrangements of user devices, servers and communication networks. A person skilled in the art will recognize many variations, alternatives, and modifications of embodiments of the present disclosure.
Referring to FIG. 2, illustrated is an exemplary user interface 200 rendered on a user device displaying a set of data items, in accordance with an embodiment of the present disclosure. As shown, the user interface 200 comprises a first user interface section 202 displaying the set of data items. The first user interface section 202 comprises a first data item 204 displaying a video. Moreover, the first user interface section 202 comprises a second data item 206 displaying a document. Furthermore, the first user interface section 202 comprises a third data item 208 displaying a web-page. As shown, the user interface 200 comprises a second user interface section 210 which enables the user to provide an input such as a first document. As shown, the user interface 200 comprises a third user interface section 212 which enables the user to perform a plurality of searches. As shown, the user interface 200 comprises a fourth user interface section 214 that allows the user to switch to different tabs such as search, community, learning, my account and sign out.
It will be appreciated that FIG. 2 is merely an example, which should not unduly limit the scope of the claims herein. A person skilled in the art will recognize many variations, alternatives, and modifications of 10 embodiments of the present disclosure.
Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a nonexclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural.

Claims (5)

  1. CLAIMSWhat is claimed is: 1. A recommender system comprising: a server arrangement configured to: receive an input from a user device via a communication network, wherein the input is a first document in a first format to initiate a document-based search; extract textual information from the first document independent of permanent storage of the first document or the textual information in the first document; determine a boundary of each named entity of a plurality of named entities from the extracted textual information based on a combination of a statistical system and a rule-base system; classify the plurality of named entities into in a set of predefined classes, wherein each predefined class of the set of predefined classes indicates a type of the named entity; retrieve a set of data items in a plurality of different formats from a plurality of different online or offline data sources, based on at least the classified plurality of named entities; and control display of the retrieved set of data items on the user device such that a visual relevancy of the retrieved set of data items with respect to the first document is discernible.
  2. 2. A recommender system according to claim 1, wherein the server arrangement is configured to determine co-occurrence and relative position of the plurality of named entities within the first document, from the extracted textual information.
  3. 3. A recommender system according to claims 1 or 2, wherein the server arrangement is configured to calibrate a relationship among the classified plurality of named entities based on at least the determined co-occurrence and the relative position of the plurality of named entities within the first document.
  4. 4. A recommender system according to any of the preceding claims, wherein the set of data items in the plurality of different formats comprises at least two of: a text-containing document, a video, an audio, an image file, a social media item, a web-page, or another document format.
  5. 5. A recommender system according to any of the preceding claims, wherein the plurality of different online data sources corresponds to a plurality of web-based platforms from which data items are retrieved on-the-fly, and wherein the plurality of offline data sources corresponds to prestored data items in the plurality of different formats in the recommender system.
GB1917709.6A 2019-12-04 2019-12-04 Recommender system for document-based search Withdrawn GB2589608A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1917709.6A GB2589608A (en) 2019-12-04 2019-12-04 Recommender system for document-based search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1917709.6A GB2589608A (en) 2019-12-04 2019-12-04 Recommender system for document-based search

Publications (2)

Publication Number Publication Date
GB201917709D0 GB201917709D0 (en) 2020-01-15
GB2589608A true GB2589608A (en) 2021-06-09

Family

ID=69147210

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1917709.6A Withdrawn GB2589608A (en) 2019-12-04 2019-12-04 Recommender system for document-based search

Country Status (1)

Country Link
GB (1) GB2589608A (en)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
None *

Also Published As

Publication number Publication date
GB201917709D0 (en) 2020-01-15

Similar Documents

Publication Publication Date Title
US11019107B1 (en) Systems and methods for identifying violation conditions from electronic communications
US10095690B2 (en) Automated ontology building
US9183274B1 (en) System, methods, and data structure for representing object and properties associations
US20150095320A1 (en) Apparatus, systems and methods for scoring the reliability of online information
Im et al. Linked tag: image annotation using semantic relationships between image tags
Tizard et al. Can a conversation paint a picture? mining requirements in software forums
JP7116435B2 (en) Establishing an entity model
Vysotska et al. Method of similar textual content selection based on thematic information retrieval
US11803600B2 (en) Systems and methods for intelligent content filtering and persistence
US9940354B2 (en) Providing answers to questions having both rankable and probabilistic components
Andrews et al. Creating corroborated crisis reports from social media data through formal concept analysis
Das et al. A CV parser model using entity extraction process and big data tools
Fernandes et al. Automated disaster news collection classification and geoparsing
Bakar The development of an integrated corpus for Malay language
Chakraborty et al. Text mining and analysis
KR101752257B1 (en) A system of linked open data cloud information service and a providing method thereof, and a recoding medium storing program for executing the same
Roslan et al. Biodiversity Knowledge Retrieval Application Using Natural Language Processing Technique
Thakkar Twitter sentiment analysis using hybrid naive Bayes
GB2589608A (en) Recommender system for document-based search
Weischedel et al. What can be accomplished with the state of the art in information extraction? A personal view
Nanni et al. Toward comprehensive event collections
Lv et al. Detecting user occupations on microblogging platforms: an experimental study
Lin et al. Realtime event summarization from tweets with inconsistency detection
Li Event-related collections understanding and services
Dashdorj et al. High‐level event identification in social media

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)