WO2019142094A1

WO2019142094A1 - System and method for semantic text search

Info

Publication number: WO2019142094A1
Application number: PCT/IB2019/050301
Authority: WO
Inventors: Aniruddha PANT; Kedar Swadi; Sangram KAPRE; Pramod PATIL; Ashutosh UKEY; Tejas PETHKAR
Original assignee: Algoanalytics Pvt. Ltd.
Priority date: 2018-01-18
Filing date: 2019-01-15
Publication date: 2019-07-25

Abstract

A system for sematic text search is disclosed. The system includes a personalized data storage subsystem configured to store a plurality of documents in one or more predefined forms. The system includes a search data modelling subsystem configured to analyse one or more text corpus of the plurality of documents and train one or more predictive models based on one or more analysed text corpus of the plurality of documents. The system includes a semantic search subsystem configured to receive a set of search strings from a user through an interface subsystem. The semantic search subsystem is also configured to provide a list of relevant documents from the plurality of documents based on the set of search strings provided by the user and rank the list of relevant documents in an order of relevance and generate a relevant set of keywords which captures the semantic intent of the user.

Description

SYSTEM AND METHOD FOR SEMANTIC TEXT SEARCH

This International Application claims priority from a complete patent application filed in India having Patent Application No. 201821002152, filed on January 18, 2018 and titled“SYSTEM AND METHOD FOR A SEMANTIC TEXT

SEARCH”.

FIELD OF INVENTION

Embodiments of a present disclosure relates to semantic search, and more particularly to a system and a method for semantic text search.

BACKGROUND Semantic is an engineering relating to logic or a meaning. Further, semantic search is a type of search process which has an improved accuracy search method based on an understanding of intent and the context of the search done by a user previously which is stored in a database. Semantic search is either done on a web browser or within a closed system. Semantic search happens based on various factors such as search context, a search intent, variations of word, a plurality of synonyms of searched words, a plurality of generalised queries, a plurality of specialised queries regarding the search, concept matching relating to the search or the like. One type of semantic search is a semantic text search which is based on a natural language search or queries to provide a plurality of search results. Various search systems are available which are used for sematic text search in a system.

In one such search system, textual semantic search is performed based on a large corpus of documents which is stored in a database of a search engine. The corpus of documents is stored in the database based on a plurality of searches performed by a user on the search engine. However, in such system the search becomes difficult for the user as to come up with a set of correct and complete keywords that semantically describe a set of documents which the user is searching for is difficult. Further, the keyword-based search only considers a presence or an absence of exactly a same set of keywords provided by the user.

In some other search systems, the text search is performed based on a set of keywords or a set of natural language search. In such approach, the system prepares their queries based on the keywords provided by the user. Such system achieves a very fast search even with a large amount of data based on the set of keywords. However, such system does not consider semantically or contextually related words which might be used in the corpus of documents. Also, the user is unable to guide the system by providing a feedback or to guide the search process of the system to obtain a high relevant result.

Hence, there is a need for an improved system and method for sematic text search to address the aforementioned issues.

BRIEF DESCRIPTION

In accordance with an embodiment of the present disclosure, a system for semantic text search is provided. The system includes a personalized data storage subsystem configured to store a plurality of documents in one or more predefined forms. The system also includes a search data modelling subsystem operatively coupled to the personalized data storage subsystem. The search data modelling subsystem is configured to analyse one or more text corpus of the plurality of documents. The search data modelling subsystem is also configured to train one or more predictive models based on one or more analysed text corpus of the plurality of documents. The system further includes a semantic search subsystem operatively coupled to the search data modelling subsystem. The semantic search subsystem is configured to receive a set of search strings from a user through an interface subsystem, wherein the set of search strings comprises at least one of a set of keywords, a set of criteria and a set of documents references. The semantic search subsystem is also configured to provide a list of relevant documents from the plurality of documents based on the set of search strings provided by the user. The semantic search subsystem is further configured to rank the list of relevant documents in an order of relevance and generate a relevant set of keywords which captures the semantic intent of the user.

In accordance with an embodiment of the present disclosure, a method for semantic text search is provided. The method includes storing, by a personalized data storage subsystem, a plurality of documents in one or more predefined forms. The method also includes analysing, by a search data modelling subsystem, one or more text corpus of the plurality of documents. The method further includes training, by the search data modelling subsystem, one or more predictive models based on one or more analysed text corpus of the plurality of documents. The method further includes receiving, by a semantic search subsystem, a set of search strings from a user through an interface subsystem, wherein the set of search strings comprises at least one of a set of keywords, a set of criteria and a set of documents references. The method further includes providing, by the semantic search subsystem, a list of relevant documents from the plurality of documents based on the set of search strings provided by the user. The method further includes ranking, by a semantic search subsystem, the list of relevant documents in an order of relevance and generate a relevant set of keywords which captures the semantic intent of the user.

To further clarify the advantages and features of the present disclosure, a more particular description of the disclosure will follow by reference to specific embodiments thereof, which are illustrated in the appended figures. It is to be appreciated that these figures depict only typical embodiments of the disclosure and are therefore not to be considered limiting in scope. The disclosure will be described and explained with additional specificity and detail with the appended figures.

BRIEF DESCRIPTION OF THE DRAWINGS The disclosure will be described and explained with additional specificity and detail with the accompanying figures in which:

FIG. 1 illustrates a block diagram of a system for semantic text search in accordance with an embodiment of the present disclosure; FIG. 2 illustrates a block diagram of an exemplary system for semantic text search of FIG. 1 in accordance with an embodiment of the present disclosure; and

FIG. 3 illustrates a flow chart representing the steps involved in a method for sematic text search of FIG. 1 in accordance with an embodiment of the present disclosure.

Further, those skilled in the art will appreciate that elements in the figures are illustrated for simplicity and may not have necessarily been drawn to scale. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the figures by conventional symbols, and the figures may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the figures with details that will be readily apparent to those skilled in the art having the benefit of the description herein.

DETAILED DESCRIPTION For the purpose of promoting an understanding of the principles of the disclosure, reference will now be made to the embodiment illustrated in the figures and specific language will be used to describe them. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended. Such alterations and further modifications in the illustrated system, and such further applications of the principles of the disclosure as would normally occur to those skilled in the art are to be construed as being within the scope of the present disclosure.

The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such a process or method. Similarly, one or more devices or sub-systems or elements or structures or components preceded by "comprises... a" does not, without more constraints, preclude the existence of other devices, sub-systems, elements, structures, components, additional devices, additional sub-systems, additional elements, additional structures or additional components. Appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but not necessarily do, all refer to the same embodiment. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which this disclosure belongs. The system, methods, and examples provided herein are only illustrative and not intended to be limiting.

In the following specification and the claims, reference will be made to a number of terms, which shall be defined to have the following meanings. The singular forms“a”,“an”, and“the” include plural references unless the context clearly dictates otherwise.

Embodiments of the present disclosure relate to a system for semantic text search. The system includes a personalized data storage subsystem configured to store a plurality of documents in one or more predefined forms. The system also includes a search data modelling subsystem operatively coupled to the personalized data storage subsystem. The search data modelling subsystem is configured to analyse one or more text corpus of the plurality of documents. The search data modelling subsystem is also configured to train one or more predictive models based on one or more analysed text corpus of the plurality of documents. The system further includes a semantic search subsystem operatively coupled to the search data modelling subsystem. The semantic search subsystem is configured to receive a set of search strings from a user through an interface subsystem, wherein the set of search strings comprises at least one of a set of keywords, a set of criteria and a set of documents references. The semantic search subsystem is also configured to provide a list of relevant documents from the plurality of documents based on the set of search strings provided by the user. The semantic search subsystem is further configured to rank the list of relevant documents in an order of relevance and generate a relevant set of keywords which captures the semantic intent of the user. FIG. 1 is a block diagram representation of a system (10) for semantic text search in accordance with an embodiment of the present disclosure. As used herein, the term“semantic search” is defined as a data searching technique in a which a search query aims to not only find keywords, but to determine the intent and contextual meaning of the words a person is using for search. In general, find text based not only on the words, but also on the way the words act upon and modify each other. One embodiment includes, apply additional knowledge bases by enriching the text with additional information such as synonyms.

The system (10) includes a personalized data storage subsystem (20) configured to store a plurality of documents in one or more predefined forms. As used herein, the term“document” may be defined as is an information recorded in a manner which requires a computer or an electronic device to display, interpret, and process the document. This includes documents generated by software and stored on volatile and/or non-volatile storage. The document may include, but not limited to, articles, electronic mail, web pages, tweets, unstructured text records, or a combination thereof. The electronic document may include one or more electronic parsable texts. In a specific embodiment, the one or more predefined forms may include at least one of a set of raw files, a structural data set and an unstructured data set. As used herein,“raw file” is defined as a primary file, which is collected from a source an unprocessed. Similarly,“structural data set” is defined as an information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine models or other search operations. Whereas,“unstructured data set” is defined as an information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

In one embodiment, the personalized data storage subsystem (20) may include at least one of a relational database management system, a key value pair system, in memory storage device and a disk-based storage device. In some embodiments, the personalized data storage subsystem (20) may also be configured to store meta data of the plurality of documents. In a specific embodiment, the meta data may include, but not limited to, a serial number, a date of document creation, a date of document modification, a date of document last access, a class of the document, an author of the document, a title of the document, a list of related documents and a nature of relation among the plurality of documents. In one embodiment, the personalized data storage subsystem (20) may be configured to allow a user to add, remove or edit the plurality of documents. In some embodiments, the personalized data storage subsystem (20) also configured to allow the user to add the plurality documents corresponding metadata to one or more text corpus, and provide access to a subset of the plurality of documents and the corresponding metadata from the one or more text corpus as desired by a set of criteria set by the user based on the data within the plurality of documents themselves, and the corresponding metadata.

Furthermore, the system (10) includes a search data modelling subsystem (30) which is operatively coupled to the personalized data storage subsystem (20). The search data modelling subsystem (30) is configured to analyse one or more text corpus of the plurality of documents. As used herein, the term“text corpus” is defined as a large and structured set of texts. Examples of a text corpus include the entire Internet, an electronic library, or a document repository. The search data modelling subsystem (30) is also configured to train one or more predictive models based on one or more analysed text corpus of the plurality of documents. In one embodiment, the one or more predictive models may include, but not limited to, an attention networks model, a recurrent networks model and a convolutional neural networks model and a combination thereof. The one or more predictive models learns continuously from the one or more text corpus stored in the personalized data storage subsystem (20) or from a plurality of updated documents added on the later stages of time.

Moreover, the system (10) further includes a semantic search subsystem (40) which is operatively coupled to the search data modelling subsystem (30). The sematic search subsystem (40) is configured to receive a set of search strings from the user through an interface subsystem (50), wherein the set of search strings includes at least one of a set of keywords, a set of criteria and a set of documents references. In one embodiment, the user may be a machine or a human. The set of criteria may be based on one or more parameters such as, but not limited to, the contents of the document or the corresponding meta data. The set of document reference may include, but not limited to, a document number or a document name. The semantic search subsystem (40) may determine one or more words and one or more phrases which are semantically related to the set of keywords, a set of criteria or a set of reference documents specified in the set of search strings.

The sematic search subsystem (40) is also configured to provide a list of relevant documents from the plurality of documents based on the set of search strings provided by the user and the one or more trained predictive models. The semantic search subsystem (40) is further configured to rank the list of relevant documents in an order of relevance and generate a relevant set of keywords which captures the semantic intent of the user. In one embodiment, the semantic search subsystem (40) may also be configured to provide the list of documents in a manner similar to the set of document references by considering the set of keywords and the set of criteria.

In some embodiments, the system (10) may include an interface subsystem (50) which is operatively coupled to the semantic search subsystem (40). The interface subsystem (50) is configured to visualize or edit the set of search strings provided for the search. In such embodiment, the interface subsystem (50) may be configured visualize and edit the set of keywords which are provided for the search, edit a set of scope-limiting criteria which are provided for the search, invoke the semantic search subsystem, visualize the list of documents which are the results of the search and visualize the set of keywords resulting from invoking the search. In another embodiment, the interface subsystem (50) may be configured to select or manually override the ranking of results from the semantic search subsystem, visualize a selected document and important components of the document and make the selected document available to other systems. In yet another embodiment, the interface subsystem (50) may be further configured to save a history of the set of search strings to be accessible and editable for a search at a later time instance. In one embodiment, the interface subsystem (50) may be configured to allow the user to specify the criteria for search, to inspect the results of the search, and to provide feedback to the semantic search subsystem to be able to iteratively refine results created earlier. In some embodiments, the above-mentioned subsystems may be connected in a manner allow the user to progressively arrive at the most desired set of documents as required by the search criteria. Such scenario is possible by first specifying the criteria via the interface subsystem, and then invoking the semantic search subsystem either via an on-screen user interface (UI) element, or via an application programming interface (API). As used herein, the term“application programming interface” is defined as is a set of clearly defined methods of communication among various subsystems. Furthermore, once the results of the sematic search subsystem (40) are available, the set of criteria, the set of keywords or the set of document preferences may be changed via the on-screen UI element or via an API call. Such new edited set of inputs may be provided to the sematic search subsystem (40) for another list of semantic search results.

In one embodiment, although shown as a single system (10), the functionality of system (10) may be implemented as a distributed system. Further, the functionality disclosed herein may be implemented on separate servers or devices which may be coupled together over a network.

FIG. 2 is a block diagram representation of an exemplary system (10) for sematic text search of FIG. 1 in accordance with an embodiment of the present disclosure. The context for the system (10) comes from the analysis of a body of text, called a base text or a text corpus, that is representative of the interests of a particular community, as a guide to interpreting the words in their vocabulary, that is, the aggregate of words in the text corpus. The system (10) includes a personalized data storage subsystem (20) which allows a user (60) to locate information. For example, the personalized data storage subsystem (20) is a company database which stores business information of an enterprise. The business information includes, but is not limited to, business transaction records, products and parties relating to the business transactions, and finance and accounting information relating to the business transactions and news updates published on public front. Each and every details of business information are stored in the company database which may be accessible over the Internet, an intra-enterprise network, a local area network, a wide area network, or any suitable networks. The system (10) also includes a search data modelling subsystem (30) which is coupled to the personalized data storage subsystem (20) (in the example, the company database). The search data modelling subsystem (30) analyses the business information stored in the company database. Further, the search data modelling subsystem (30) uses a corpus of texts with a single point of view to train a predictive model (such as neural network) to extract the context- specific meaning of the terms appearing in the corpus.

The system (10) further includes a sematic search subsystem (40) and an interface subsystem (50) coupled to each other. The interface subsystem (50) enables the user (60) to input a search string which may be a set of keywords, a set of criteria or a set of document references. The semantic search subsystem (40) receives the search string and return a list of documents which are relevant to corresponding search string. Furthermore, the semantic search subsystem (40) arranges the list of relevant documents in an order of relevancy and also generate and highlight a plurality of similar keywords in the list of documents. For example, the user (60) enters a search string for searching a particular fruit which is product of a customer‘x’, the system processes a series of documents known to be about the customer‘x’ to learn about them. Other texts that are produced either by or for a specific community could similarly be used to train the neural network to learn other contexts. The search string is processed through the same neural network to produce the corresponding semantic profile. Each word in the search string is entered into a text vector and such text vector is then fed to the neural network. The pattern of activation of the hidden units represents a semantic profile of the search terms. Then, the search data modelling subsystem (30) compares the semantic profile of the search terms against the semantic profiles of each stored information.

The semantic search subsystem (40) processes a text of known relevance, such as the base text or text corpus, to extract the vocabulary and semantic patterns. Such text might consist of a set of articles published in a news or special interest magazine for some period of time. The semantic search subsystem (40) not only search the plurality of relevant articles in the news which includes a word fruit of company‘x’ but also search for a plurality of fruits which the company‘x’ sells. So, while executing a search string“a particular fruit of company‘x’" wherever “orange sold by company‘x,”,“grapes of company‘x’ won an award from a wine company” and“grapes at $260” is mentioned in the articles or product details or in transaction details, the semantic search subsystem (40) return and visualize, on the interface subsystem (50), the list of documents where the above statements are mentioned. Once the results of the semantic search subsystem (40) are available, the set of criteria, the set of keywords or the set of document reference may be changed via the interface subsystem (50) or via an application programming interface (API) call. Such new edited set of inputs may be provided to the semantic search subsystem (40) for another list of semantic search results. FIG. 3 is a flow chart representing the steps involved in a method (100) for sematic text search of FIG. 1. In accordance with an embodiment of the present disclosure. The method (100) includes storing, by a personalized data storage subsystem (20), a plurality of documents in one or more predefined forms in step 110. In one embodiment, storing the plurality of documents in the one or more predefined forms may include storing the plurality of documents in the form of at least one of a set of raw files, a structural data set and an unstructured data set. In some embodiments, the method (100) may include storing meta data of the plurality of documents. In such embodiment, storing the meta data of the plurality of documents may include storing a serial number, a date of document creation, a date of document modification, a date of document last access, a class of the document, an author of the document, a title of the document, a list of related documents and a nature of relation among the plurality of documents.

In one embodiment, the method (100) may include allow a user to add, remove or edit the plurality of documents. In such embodiment, the method (100) may include allowing, by the personalized data storage subsystem, the user to add the plurality documents corresponding metadata to one or more text corpus, and providing access to a subset of the plurality of documents and the corresponding metadata from the one or more text corpus as desired by a set of criteria set by the user based on the data within the plurality of documents themselves, and the corresponding metadata.

Furthermore, the method (100) also includes analysing, by a search data modelling subsystem, one or more text corpus of the plurality of documents in step 120. The method (100) further includes training, by the search data modelling subsystem, one or more predictive models based on one or more analysed text corpus of the plurality of documents in step 130. In one embodiment, training the one or more predictive models based on one or more analysed text corpus of the plurality of documents may include training an attention networks model, a recurrent networks model and a convolutional neural networks model based on one or more analysed text corpus of the plurality of documents.

Moreover, the method (100) further includes receiving, by a semantic search subsystem, a set of search strings from a user through an interface subsystem, wherein receiving the set of search strings includes receiving at least one of a set of keywords, a set of criteria and a set of documents references in step 140. In one embodiment, the method (100) may include determining, by the sematic search subsystem, one or more words and one or more phrases which are semantically related to the set of keywords, a set of criteria or a set of reference documents specified in the set of search strings. The method (100) further includes providing, by the semantic search subsystem, a list of relevant documents from the plurality of documents based on the set of search strings provided by the user and one or more trained predictive models in step 150.

In addition, the method (100) further includes ranking, by a semantic search subsystem, the list of relevant documents in an order of relevance and generate a relevant set of keywords which captures the semantic intent of the user in step 160. In some embodiments, the method (100) may include provide, by the semantic search subsystem, the list of documents in a manner similar to the set of document references by considering the set of keywords and the set of criteria. In a specific embodiment, the method (100) may include visualizing or editing, by an interface subsystem, the set of search strings provided for the search. In one embodiment, the method (100) may include saving a history of the set of search strings to be accessible and editable for a search at a later time instance.

In some embodiment, the method (100) may include allowing, by the interface subsystem, the user to specify the criteria for search, to inspect the results of the search, and to provide feedback to the semantic search subsystem to be able to iteratively refine results created earlier.

Various embodiments of the system and method (100) for semantic text search described above enables an efficient response to a request to perform the search in the storage subsystem based on semantics relationships of the search term and an entity in view of one or more transactions associated with the entity and presenting a search result of the search.

Furthermore, the system generally increases the efficiency of the search in the storage subsystem when the system provides the list of documents in a manner similar to the set of document references by considering the set of keywords and the set of criteria.

In addition, the aim of the disclosure therefore is to find a balance between performance and quality of the search that is acceptable within the boundaries of specific business case. It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the disclosure and are not intended to be restrictive thereof.

While specific language has been used to describe the disclosure, any limitations arising on account of the same are not intended. As would be apparent to a person skilled in the art, various working modifications may be made to the method in order to implement the inventive concept as taught herein.

The figures and the foregoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, order of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts need to be necessarily performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples.

Claims

WE CLAIM:

1. A system (10) for semantic text search comprising: a personalized data storage subsystem (20) configured to store a plurality of documents in one or more predefined forms; a search data modelling subsystem (30) operatively coupled to the personalized data storage subsystem (20), and configured to: analyse one or more text corpus of the plurality of documents; train one or more predictive models based on one or more analysed text corpus of the plurality of documents; a semantic search subsystem (40) operatively coupled to the search data modelling subsystem (30), and configured to: receive a set of search strings from a user through an interface subsystem (50), wherein the set of search strings comprises at least one of a set of keywords, a set of criteria and a set of documents references; provide a list of relevant documents from the plurality of documents based on the set of search strings provided by the user and the one or more trained predictive models; and rank the list of relevant documents in an order of relevance and generate a relevant set of keywords which captures the semantic intent of the user. 2. The system (10) as claimed in claim 1, wherein the one or more predefined forms comprises at least one of a set of raw files, a structural data set and an unstructured data set.

3. The system (10) as claimed in claim 1, wherein the personalized data storage subsystem (20) comprises at least one of a relational database management system, a key value pair system, in-memory storage device and a disk-based storage device.

4. The system (10) as claimed in claim 1, wherein the personalized data storage subsystem (20) is further configured to store meta data of the plurality of documents.

5. The system (10) as claimed in claim 1, wherein the one or more predictive models comprises at least one of an attention networks model, a recurrent networks model and a convolutional neural networks model.

6. The system (10) as claimed in claim 1, wherein the semantic search subsystem (40) is further configured to provide the list of documents in a manner similar to the set of document references by considering the set of keywords and the set of criteria.

7. The system (10) as claimed in claim 1, wherein the user comprises a human or a machine. 8. The system (10) as claimed in claim 1, wherein the interface subsystem

(50) is also configured to visualize or edit the set of search strings provided for the search.

9. The system (10) as claimed in claim 1, wherein the interface subsystem (50) is further configured to save a history of the set of search strings to be accessible and editable for a search at a later time instance.

10. A method (100) comprising: storing, by a personalized data storage subsystem, a plurality of documents in one or more predefined forms (110); analysing, by a search data modelling subsystem, one or more text corpus of the plurality of documents (120); training, by the search data modelling subsystem, one or more predictive models based on one or more analysed text corpus of the plurality of documents (130); receiving, by a semantic search subsystem, a set of search strings from a user through an interface subsystem, wherein the set of search strings comprises at least one of a set of keywords, a set of criteria and a set of documents references (140); providing, by the semantic search subsystem, a list of relevant documents from the plurality of documents based on the set of search strings provided by the user and the one or more trained predictive models (150); and ranking, by a semantic search subsystem, the list of relevant documents in an order of relevance and generate a relevant set of keywords which captures the semantic intent of the user (160).

11. The method (100) as claimed in claim 10, further comprising visualizing or editing, by the interface subsystem, the set of search strings provided for the search.