GB2586002A - Improved method and system for text based searching - Google Patents

Improved method and system for text based searching Download PDF

Info

Publication number
GB2586002A
GB2586002A GB1901832.4A GB201901832A GB2586002A GB 2586002 A GB2586002 A GB 2586002A GB 201901832 A GB201901832 A GB 201901832A GB 2586002 A GB2586002 A GB 2586002A
Authority
GB
United Kingdom
Prior art keywords
words
search
word
user
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1901832.4A
Other versions
GB201901832D0 (en
Inventor
Levett David
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
All Street Res Ltd
Original Assignee
All Street Res Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by All Street Res Ltd filed Critical All Street Res Ltd
Priority to GB1901832.4A priority Critical patent/GB2586002A/en
Publication of GB201901832D0 publication Critical patent/GB201901832D0/en
Priority to PCT/GB2020/050296 priority patent/WO2020161505A1/en
Publication of GB2586002A publication Critical patent/GB2586002A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3322Query formulation using system suggestions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and system for text-based searching are provided. The method includes obtaining a user input and searching a word model to produce a list of candidate words related to the user input; the user then selecting one or more of the candidate words to form a set, translating the set of words into a mathematical representation; searching a document database for one or more documents having a word or group of words having individual mathematical representation corresponding to members of the set of mathematical representations and then retrieving from the document database, a portion of one or more documents containing the word or group of words that correspond to members of the set of mathematical representations. The method searches the database according to a semantic concept such that the user who performs the method is informed with semantically similar results. The set of mathematical representations may be a set of vectors in a natural language processing vector model. Forming the word set may involve calculating weights for each candidate word as a measure of semantic similarity between them and the input.

Description

IMPROVED METHOD AND SYSTEM FOR TEXT BASED SEARCHING
TECHNICAL FIELD
The application relates to a method and system for improved text based searching, and for providing a text portion relevance score in relation to search results.
BACKGROUND ART
In today's knowledge economy, in which millions of documents are readily available over networked computers such as the Internet, finding the right document in a timely and efficient manner can be difficult. In order to make retrieval of on-line documents possible, sophisticated search tools have been developed, such as web-browsers or database query engines that run on the user's computer, and which generally accept a string of free text, input by the user, for searching. Such tools try and match the input text against a database of available documents, and the documents deemed to be the most relevant are returned to the user in a list of results or 'hits'. In the context of the Internet, results lists will typically include a list of the URLs (Universal Resource Locators) at which the documents can be found, as well as a brief description of the document itself.
Popular searching techniques are however often fairly broad, typically failing to provide sufficient access into the content of the document. In other words, although a search algorithm might return a list of relevant documents, a searcher may still have to individually open and read through each of the documents to determine if it is in fact relevant. If a document is relevant, it can take time to discover the particular part of the document that actually answers the searchers search query. If a document is not relevant to the user, but has been nevertheless been identified as being relevant by the search algorithm, it can take considerable time to read through the document and be satisfied that the document is not in fact relevant and can be discarded.
--)25 Artificial Intelligence can be applied to the problem of searching documents and providing results. However, there are a number of problems with such techniques. First, approaches based on artificial intelligence typically involve machine learning algorithms that must first be trained on a corpus of material. Results are displayed to a user and the user is asked to feedback their perceived relevance. In this way the machine learning algorithm learns which results were well received and which were not, and adjusts itself accordingly.
This is laborious, and can mean that results outside of the scope of the training corpus are not dealt with as effectively. Second, as far as the user is concerned the operation of a machine learning algorithm is opaque. The algorithm is trained to give a particular output in response to a particular input, but the user does not know why the algorithm has learned that response, and cannot easily adjust the results if they are not satisfactory.
The issues noted above are especially problematic for analysists and researchers who need to efficiently uncover relevant information and compile this into a report, and who may additionally need to track precisely where in a document relevant information was found, and provide this for references and auditing.
For example, when investors are looking where to invest, they will read several financial analysis reports on different sectors or economic entities. The financial reports may include investment analysis reports, equity research reports and the like. An equity research report is a document that is usually prepared by an analyst. The research report may focus on a specific security or industry sector, or on a geographic region or country, for example.
Research reports such as this contain information from a variety of sources such as from public records, journals and the internet. The information contained in a research report may include relevant metrics regarding the topic or economic entity at the focus of the report, the names of key shareholders, statistical analysis regarding the state of a particular market, Th) profitability and future strategy and challenges facing the economic entity or sector in question. Access to the reports may be sold by the analyst to investors.
As noted above, obtaining all this information from a variety of sources can be time consuming and expensive for the research analyst and investor alike. For these reasons, analysts and investors tend to produce and subscribe to reports that focus on large economic entities or sectors, because information regarding these entities and sectors is more readily available and as such investment opportunities can be identified more easily.
On the other hand, small economic entities such as small and medium sized enterprises (SMEs) are rarely the focus of research reports. This means that SMEs get less exposure to investors and as a result find it difficult to attract investment. This also means that investors are not informed of potential investment opportunities with smaller economic entities. In the US and Europe alone there are 50 million such SMEs, yet only 10,000 are traded on the capital markets.
Furthermore, due to the time spent producing the research report, the financial analyst can charge relatively large amounts of money for access to the research report. This means that smaller or individual investors are unlikely to be able to afford access to the research reports.
It would be beneficial if the time spent by the analyst in this example working on producing a research report was minimised, such that the cost of accessing the report is also reduced. This would allow more reports to be produced, such that smaller economic entities can also be the focus of reports, and would further allow smaller or individual investors to access the reports due to the lower cost of accessing the report.
In order to effectively reduce the time spent by the analyst in this example on producing the report, and therefore the cost of the report, language recognition techniques may be used to identify relevant pieces of information that can be included in the report. However, language recognition techniques can be difficult to implement due to the many-tomany relationship between words and meanings in language. One word may have many meanings, and similarly many words can define the same concept, or mean the same thing.
This gives rise to lexical ambiguity, which can make extracting relevant information using language recognition techniques difficult without using vast amounts of training data as discussed above.
Previous known methods of performing a text-based search include methods such as classical text searching and syntax analysis. Classical text searching matches content in a search string with content being searched in a database such as the internet. This method uses some general knowledge and/or a list of common misspellings to identify search results that have a sequence of characters that are very similar to the search string. This method is simple and non-expansive. Examples of this style of text-based search include simple web- -) browsing search applications.
A method which focuses on syntax analysis attempt to evaluate the grammatical structure of content being searched to identify features of the content such as nouns, verbs and adjectives. This method may involve referring to a syntax tree, which provides information regarding the relationship between words in a sentence according to the grammar and structure of the sentence. This method may produce a reasonable grammatical structure from which more relevant results can be found than simple text searching, but the process is computationally expensive and fails to identify the meaning of words, so content is not searched based on meaning.
Recent advances in machine learning used to model and perform pattern matching using artificial "neural networks" have been employed as an alternative to algorithmic or rules-based analysis of natural language text. These approaches typically consume a substantial corpus of manually labelled °training" material, often from many thousands of source documents. Numerical parameters or "weights" in the network are adjusted to identify correlations between the input data and the training labels.
The process of identifying sufficient relevant source material and manual labelling is arduous and when insufficient, results in biased models. As an example, when such a system was trained on manually labelled paragraphs relating to a "Labour Relations" category, the algorithm detected that a significant number of the content came from documents soured from vehicle manufacturers which produced a reliable but undesired bias in the model.
We have appreciated that it would be beneficial to allow analysts to interactively define categories to be identified in terms of their semantic content, without the need for a labelled training corpus. This approach means that it is possible to efficiently and accurately define the semantic context for a desired category which can then be searched for in natural language content such as company reports. Thus we provide a method for searching and extracting portions of documents based on semantic analysis of the documents, which is both fast and accurate, without the need for vast amounts of training data.
SUMMARY OF THE INVENTION
The invention is defined in the independent claims to which reference should now be made. Advantageous features are defined in the dependent claims.
The present invention described herein relates to a computer implemented method and system for performing semantic analysis on words and groups of words retrieved from structured or unstructured documents, to ultimately provide accurate and semantically meaningful results from a text based search. The present invention therefore focuses on searching according to the meaning of words or group of words, rather than the sequence of characters or the grammatical structure of sentences.
Furthermore, when compared with other attempts to use semantic analysis in language recognition techniques, the present invention is more robust and efficient, as it does not require vast amounts of training data as is common in previous approaches.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will now be described, by way of illustration only, and with reference to the drawings, in which: Figure 1 is a flow diagram of the method according to the present invention; Figure 2 is a detailed flow diagram of step 102 of the method according to Figure 1; Figure 3 is an example of a semantic structure of nodes according to the present invention; Figure 4 is an example of a user selection display according to the present invention; Figure 5 is a detailed flow diagram of step 103 of the method according to Figure 1; Figure 6 is an example of a portion of text extracted by the method according to the present invention; Figure 7 is an illustration of the system according to the present invention; and Figure 8 is a detailed flow diagram of the method according to the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
According to a first embodiment of the invention, a computer implemented method for searching for text content and producing a relevance score is discussed here with reference to Figure 1. The method according to figure 1 comprises steps 101 to 106.
The method comprises a multi-stage search algorithm that enables large numbers of documents to be searched and relevant portions of documents to be quickly presented to a user. In an initial stage of the search, a user is asked to select one or more search terms from a word relationship model, provided to the user based on the subject of the search. As well as providing increasing levels of granularity for the search, the use of the word model allows the user to eliminate from the search unwanted interpretations or meanings of a search term, thereby allowing the search to proceed more efficiently. In this step of the method, for example, the user may be presented with a user interface which allows them to explicitly include or exclude meanings for words that have a number of different interpretations. In a second stage of the search, the selected search terms are converted into a set of mathematical representations for mapping against words found in documents in a searchable database. As meanings of words that are not of interest have been ruled out by the user in the first stage by the word model, the conversion of the search terms to the mathematical representation, and the searching of the database on the basis of the mathematical representation can proceed in an efficient manner.
The method is preferably implemented in software on a computing device. As well as the functional steps of the method described in more detail below, it will be appreciated that the software, when run on the computer, will cause a graphical user interface to be displayed on a screen associated with the computing device, for receiving input from a user, and for presenting the results of the search to the user. The computing device may be a personal computer, a networked workstation, a portable computing device, such as a laptop, a smartphone or a tablet computer, a games console, a Virtual Reality headset, or the like. The steps of the method may be carried out while the computing device is on or off-line, and where a database is referred to, this may be implemented locally or over a network to which the computing device has access. The computing device may comprise a memory. The memory may be implemented using any type of volatile or non-volatile memory devices, or a combination thereof, such as a static random access memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic or optical disk.
The computer device used for may further comprise a processor, and various peripheral interface modules, the peripheral interface modules being, for example, a keyboard, a click wheel, buttons, and the like. The computer device may also include a power supply component configured to implement power management on the device, a wired or wireless network interface configured to connect the device to a network, and an input/output interface. The device may operate based on an operating system such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like stored in the memory.
The methods of the present invention described herein may be provided on a non-transitory computer readable storage medium including instructions, such as a memory including instructions, executable by the processor of the computer device. For example, the non-transitory computer-readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disc, an optical data storage device, and the so on.
Particular embodiments of the invention may include a user interface that assists the user in writing a document. In this context, a document model may be provided to the user for completion, and within the document model, there may be a list of suggested sections and space within each section for the user to enter corresponding text. Within the context of each section that needs to be completed. Further, if a document model is used in which there are suggested sections for completion, the results of any search may be displayed within the user interface in a position corresponding to the relevant section, allowing portions of the results to be added to the relevant section of the user document being completed. This will be described in more detail later.
The method will now be described in more detail according to Figure 1.
In step 101 of the method, a database is connected to. The database includes one or more documents, and may be accessed via a memory, or the internet for example. The database may initially be empty, in which case step 101 includes uploading, saving or otherwise inputting documents to the database. Alternatively, the database may be predefined, with one or more documents pre-saved or uploaded to the database. The aim of this step is simply to ensure that the method according to the present invention is able to communicate with the database. The database need not be connected to at the start of the method according to Figure 1, and alternatively step 101 may be performed at any time before step 104. The database may be updated before, during, or after the method has been performed.
The database may be a private database provided specifically for use by analysts and researchers and not generally available to the public. The documents are understood to contain potentially relevant information for use by analysts and researchers when generating reports. The documents are therefore human readable in the sense that once they are discovered by the user in a search, they can be displayed on screen and read. The documents may be stored in any suitable text based format, such as portable document format (.PDF), the document format (.doc or.docx) from Wordrm, HTML and so on.
In step 102 of the method, a user input is received. The user input is used to determine what the user would like to search the database for. The user input therefore relates to the broad concept of the search of the database that the user intends to perform. The user may create the concept themselves, by inputting text. In this case, the user may then also create one or more category labels that are commonly associated with the concept of the search. Similarly, the user may optionally input words or combinations of words that appear in text related to the category labels. For instance, the concept for the search may be 'Supply Chain' and the user may then also input a related category label such as 'logistics'. Finally, the user may then input words or combinations of words related to the category label, such as 'shipping, delivery, transportation' for example. Once the user has inputted the concept, category label, and words related to the category label, the combination of these different aspects may be saved for future searches.
Alternatively, any one or more of the concept, category label, or words related to the category label may be preset. In this case, in step 102, the user input is a selection of a concept, and at least one category label for use in defining topic for searching. The user may also optionally select related words. In this case the category label is a predefined word or phrase from a predefined list of category labels related to the concept for the search, and corresponds generally to a top level search term.
Once the user either selected or inputted the concept and one or more category labels, a word model is searched according to the input of the user in step 102. The word model is a semantic model that stores a plurality of words, such as the WordNet model (https://wordnet.princeton.edu) developed at Princeton University. Each word in the word model is represented by a node such that different nodes correspond to different words. Connections exist between nodes. When a node is connected to another node, the words to which the connected nodes correspond are semantically similar, and share a similar meaning, or are otherwise related. For example, a node corresponding to a category label 'capital' may be connected to nodes corresponding to the words 'capital letters', 'state capital', 'venture capital' and 'capital punishment'. A node can be connected to many nodes. Furthermore, nodes can be connected in a chain or tree like fashion, such that a first node is connected to a third node through a second node. This connection is an example of an indirect connection between the first and third node.
Each category label in the word model is connected to at least one sub-category label in the word model. In general, the category labels are broad terms and each are linked to several sub-category labels which relate to their meaning. For instance, the category label 'telecommunications' may have sub-category labels such as 'carrier', 'operator', 'mobile' and 'manufacturer'. Similarly, the category label 'EMEA' may have sub-category labels such as 'Western Europe', 'Germany' and 'EU'. As noted above, the category labels, may be tailored to assist with the likely sections in a report to be generated, or as with web browsers may simply be top level generic search terms for wider searches defined by the user.
In step 102, the sub-category labels that are connected to the category label are retrieved from the word model. In more detail, for each selected or inputted category label, the word model is searched for a node that corresponds to a match to the category label. Once the category label is matched, the word model is explored to find connected nodes corresponding to semantically similar words to the category label. The number of steps taken between the matched node and a connected node, or in other words, the number of nodes between the matched node and an indirectly connected node, is recorded to define a ranking of semantic similarity between the category label and words from the word model. This ranking represents how easy or not easy it is to find a connected node given the matched node. The more steps that are taken between the matched node and another particular node, the lower the ranking of similarity there is between the matched node and the particular node. Therefore, when a node is connected directly to the matched node corresponding to the category label, the connected node is ranked higher than a node that is indirectly connected to the matched node.
The nodes that are explored in the word model are also ranked according to how frequently they are found by the search of the word model. A node in the word model may connect to the matched node multiple times through different connections, some of which may be direct and some of which may be indirect. Nodes which are connected by a plurality of paths to the matched node are ranked higher than nodes only connected by one path.
The sub-category labels discovered by the search of the word model are returned to the user as sample suggestions if they are ranked higher than a threshold ranking or score in the two rankings discussed above. It is to be understood that the rankings systems discussed above are preferably combined and compared to a single ranking threshold, but may exist separately and be compared to separate ranking thresholds.
When the sub-category labels are returned to the user, they are presented to the user for selection or deselection. Sub-category labels selected by the user at this stage will appear in the subsequent stages of the search, while deselected words will not. This allows the user to tailor their search quite specifically and greatly reduces the computation needed for later stages of the search. For example, in the context of a financial search, if the user selects "capital", they may then select "venture capital" for inclusion, but deselect "state capital", "capital punishment" and so on.
In this embodiment, the user indicates which sub-category labels are relevant to the desired sematic meaning of the category label in question. Selected sub-category labels are deemed as relevant and deselected sub-category labels are not relevant to the meaning of 30 the category label intended or needed for the users search.
The selected sub-category labels may then be identified and matched in the word model in the same way the original category label is, as discussed above. In this case, the matched nodes in the word model therefore correspond to the originally selected at least one category label and the selected sub-category labels. The matched nodes are then used as a starting point to discover other related words by navigating across relationships to other words, represented in the word model by connections to other nodes. The number of relationship steps required to discover a related word, and the number of times a word is discovered from different starting words is used to prioritize related words in terms of importance as discussed above with reference to the initial category label. In other words, the process of producing a ranking of nodes connected to the matched nodes repeats once the user has selected subcategory nodes. The ranking is then used as before for comparison to a ranking threshold to determine if more words from the word model should be included in the search.
This additional step is implemented to expand the number of selected words from the user selection of category and sub-category labels to also include words from the word model that are similar in meaning. The selected words and related words from the list of words to be translated into a mathematical representation of the original concept, category and sub-category labels.
The process of step 102 may repeat for different concepts and category labels, and select or deselect sub-category labels provided by the word model until the user is satisfied with the focus of the search.
In step 103 of the method, the selected list of words from the word model are translated to a mathematical representation of the selected list of words. The mathematical representation may include a matrix, set of vectors, a polynomial or other mathematical function capable of describing the semantic relationship between the words in the selected list of words and the original user selected or inputted concept.
In one embodiment, the mathematical representation is a set of numerical vectors in a natural language processing (NLP) vector model.
The numerical vectors are N dimensional, where N is a positive real integer, and exist in an N dimensional vector space. The method of translating the selected words into numerical vectors is performed according to techniques or models such as Word2Vev (Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space" -2013). The Word2Vec model is dependent on several features including word co-occurrence and analogous words.
These features help identify the meaning of a particular word within the vector space. Translating the selected words into vectors makes it possible to mathematically manipulate the selected words and compare them to other words in the search. Generally, vectors of a similar direction in the N dimensional vector space are considered to be similar. 30 The dot product or cosine distance of two vectors can thus be used as a measure of their similarity. The translation of the list of selected words into the set of vectors may be performed using a dictionary look-up, whereby the dictionary includes words and their corresponding vectors. The vectors themselves can be determined by traditional linear algebra approaches such as co-occurrence matrices and Singular Value Decomposition.
Once the list of words are translated to vectors in the NLP vector model, a clustering algorithm is used to discover a set of vectors that efficiently represent the words associated with the semantic concept of the model.. The cluster produced by the clustering algorithm includes some vectors according to the selected words from the word model, identified in step 102. Vectors that do not correspond to the selected words originate from the NLP vector model, and are included in the cluster or disregarded according to similarity to the vectors that correspond to the list of selected words. The original set of vectors that correspond to the list of selected words may include over 4000 vectors. The clustering algorithm produces a smaller subset of vectors that efficiently represent the semantic concept of the selected list of words. The subset of vectors produced by the clustering algorithm may include approximately 100 vectors.
When the database is eventually searched in step 104, the search is therefore performed with regard to the common meanings of the words the user initially selects or inputs in the concept, categories and sub-categories, and not only the physical words themselves. In step 104 of the method, the database that is connected or uploaded to in step 101 is searched and compared with the mathematical representation of the list of selected words. According to the embodiment wherein the mathematical representation is a vector set, this 15 means comparing the database with the vectors in the vector cluster of the NLP vector model. In order for efficient analysis of the content of the database, the content is parsed such that common words are removed. Common words include words such as 'and', 'it' and 'the' for example.
The remaining content of the database is then searched according to a 'first search' for occurrences of words that correspond to the vectors within the vector cluster. In the first search, when a word or group of words are matched to a vector within the vector cluster, the location of the matched word or group of words within the database is recorded. This may include recording the file, paragraph and/or sentence in which a match is observed within the database. This process repeats for every match within the database.
In step 105 of the method, the portions of the database content identified as containing matches in step 104 of the method are ranked or otherwise scored according to their similarity with the vector cluster. The portions may be files, paragraphs or individual sentences of content from the database. The files, paragraphs and/or sentences in which a match was recorded according to step 104 of the method are extracted for a 'second search'. The second search is more detailed than the first search and does not involve parsing the content of the database to remove common words as is done in step 104 of the method. Instead, the second search analyses each word of the database portions in which a match was recorded in step 104 of the method.
The second search measures to what extent does the match between the word or group of words in the content of the database and the vector cluster exist throughout each portion of the database. Sentences with multiple matches and/or words with a high semantic similarity to the vector cluster will be ranked higher by the second search than sentences with one random match that is disparate from the rest of the teaching or subject of the sentences. In other words, the second search takes semantic consistency into account so that sentences that consistently focus on the semantics of the vector cluster are ranked higher than sentences that are vague or inconsistent.
In step 106 of the method, the ranking or scoring of portions of content from the database are used to inform what portions of the database content are to be included in a search result document.
The search result document may include whole sentences or paragraphs from the database content according to the ranking or scoring of sentences and paragraphs in step 10 105.
Once the database content for inclusion in the search result document is decided the result document is generated. The generated result document includes direct copies of the portions of the database that were determined to be included in the search result document, or alternately may include links to such documents.
The user can then access the search result document, in order to obtain information on their initial area on which they intended to focus on according to the selections made in the method step 102.
As noted above, the results from the search may be presented to the user by means of a user interface, with the paragraphs or document sections having the highest score for similarity ranked first, and those having lower scores for similarity ranked further down. In the context of an embodiment for generating a report, the discovered paragraphs or document sections may be presented within the user interface in the document model, and may be editable. In this way, the user can select paragraphs or document sections for inclusion in their report and may even edit such sections. An example of this is discussed later in connection with Figure 6.
Each step of the method according to Figure 1 will now be discussed in more detail, making reference to alternative embodiments where necessary.
Figure 2 illustrates a more detailed view of step 102 of Figure 1.
In step 102a, the user inputs or selects a concept for the search of the database that 30 the user intends to perform. The user may create the concept themselves, by inputting text, or may select the concept from a database.
Optionally, the user may select a domain according the concept for the search. Alternatively, the domain may be selected automatically based on the concept selected or inputted by the user. The domain may comprise a domain-specific word model and/or a domain-specific mathematical representation model, such as a vector model. Each domain may therefore have a different word model and mathematical model.
In step 102b, the user is presented with at least one category label. The user selects at least one category label according to what topic or area they would like to focus the search on. As discussed previously with reference to Figure 1, examples of category labels selectable by the user may be 'telecommunications', 'Europe' or 'business strategy'.
When the method is performed on a computer, the user may be presented with several soft keys, each soft key representing a separate category label. Alternatively, the user may be presented with a drop-down box which includes the at least one category labels. Furthermore, a search function may be included such that the user can search for a category label.
The number of category labels may be predetermined. In one embodiment there are 2000 category labels. Alternatively, the user may input a category label according to the concept of the search.
Optionally, the category labels are predetermined based on a selected domain as discussed above. The category labels may thus be a sub-set of a broad topic title or area of interest.
In step 102c, the word model is searched for words relating to the selected category label from step 102b.
The word model is represented by a semantic structure of nodes and connections as discussed with reference to Figure 1. Each category label from step 102b is represented by a node in the word model. Each node that represents a category label is connected to at least one other node in the word model. The at least one other node contains a word that is related to the category label to which it is connected. Furthermore, the at least one other node may be connected to at least one other node which contains a different related word.
There are different types of relationship between words and category labels that result in a connection between their respective nodes in the word model. These relationships may include, for example, hypemyms, hyponyms, holonyms, meronyms, dictionary definitions and the like.
Each of the relationships are represented by a connection between two related nodes. The connections between nodes may be assigned a weighting. The weighting describes the proximity between the connected nodes in terms of how closely semantically related the words 30 of the connected nodes are.
When the word model is searched for related words in step 102c according to the user selected category label in step 102b, the selected category label node is identified and matched in the word model. The related words are represented by word nodes that are connected, either directly or indirectly (through multiple connections) to the selected category label. These related words are also identified.
An example of the semantic structure of nodes 300 is shown in Figure 3. Figure 3 shows what the structure may look like for the category label node 301 'Capital'. Related words to the selected category node 301 are represented by nodes 302. For the example of 'Capital', the related words in the semantic structure of nodes 300 may include 'Capital (wealth)' and 'capital letter. These relates words may further relate and thus be connected to words such as 'endowment', 'means' and 'assets' for instance. Each of the nodes in the semantic structure 300 are connected to other nodes a connection 303. As discussed above, the connection 303 may be assigned a weighting.
In one embodiment of the present invention, the weighting of each connection to each word node connected to the category label node may be retrieved. The weighting may be a measure between 0 and 1, but it is to be understood that the weighting may be over any range 10 of numbers.
The weighting of each connection may be determined based on the type of connections between the category label and the related word. Some connections, such as dictionary definitions, may be given more weighting than meronyms, for example.
It is to be understood that the weighting of each connection between nodes may be determined by any mathematical function dependent on any or all of the relationships between the category label and relates words.
As discussed with reference to Figure 1, the weighting of each connection, or the ranking of semantic similarity in the word model, takes into account the number of steps taken between the matched node and a connected node, and how frequently the connected node is 20 found by the search of the word model.
In step 102d of the method according to Figure 2, a sub-set of the related words, corresponding to sub-categories, are provided back to the user from the word model, such that the user can select none, some, or all of the related words as relevant and/or irrelevant to the topic or area that the user would like to focus the search on.
In this embodiment of the invention, in order to provide a sub-set of related words back to the user, from the related words identified as being connected to the selected category in step 102c, a decision is made depending on the weighting of the connection from the selected category label to each of the identified related words.
For direct connections between the category label and a particular related word, the weighting associated with the particular related word is simply the weighting of the connection retrieved in step 102c. For indirect connections between the category label and a related word, the weighting is compounded based on the number of connections between the category label and the related word. For instance, if the category label A connects through words B and C to connect to D, in the sequence A-B-C-D, then the overall weighting of A-D is weaker than a direct connection between A and E, in the sequence A-E. In order to achieve this, the weightings of indirect connections may be multiplied together, provided the weightings are between 0 and 1. In the above example, the connection between A-E may be weighted with the value 0.5, and the connection between A-D may have a compound weighting of 0.125, if the weighting is 0.5 per connection.
Once the compound weightings of indirect connections to related words are determined, the related words are ranked according to their ratings. A sub-set of the related words may then be selected to be provided back to the user according to their weighting ranking. The sub-set of related words selected may be the related words with the highest Y ranked weightings, where Y is a positive real integer. Alternatively, the sub-set of related words selected may be the related words with a weighting above a predetermined weighting threshold value.
Once the sub-set of related words are selected to be provided back to the user, they are provided to the user as sub-categories corresponding to the category label selected by the user in step 102b.
In an alternative embodiment of the invention, the sub-category labels of each category label are predetermined, according to the predetermined connections and relationships within the word model. In this embodiment, the optional steps of identifying related words and ranking weightings in order to determine what related words are provided back to the user are not performed in order to provide sub category labels.
When the user is provided with the sub-category labels, the sub-category labels can be ranked or scored by the user, to indicate whether each sub-category is relevant or irrelevant to the topic or area they would like to focus the search on. The user may select whether each sub-category label is relevant, irrelevant to their focus area. If the user selects whether each sub-category is relevant or irrelevant, then this information is also used to inform a re-building of the word model to form the list of words to be translated into a mathematical representation in step 103.
Once the user selects whether a particular sub-category label is relevant or irrelevant to the topic or area they would like to focus the search on, the word model is rebuilt in that it automatically updates to reflect the user's choices.
Rebuilding the word model involves editing some of the connections between nodes in the word model. If a particular sub-category is deemed irrelevant by the user, a connection 30 to the sub-category node from the category node may be deleted, for example.
However, it is preferable that when the word model is rebuilt the weightings of each of the connections are recalculated. If the user selects that a particular sub-category label is relevant, the node corresponding to the sub-category label becomes 'selected' itself, like the node corresponding to the originally selected category label. The weightings of direct and indirect connections from the sub-category label node may be recalculated according to the selection of the sub-category label node. For example, the selected sub-category node may have previously had a connection weighting of 0.5 with the originally selected category node.
After the user selects the sub-category label, the connection weighting between the subcategory label node and the category label node may be recalculated as 1.
Therefore, when a user selects that a sub-category is relevant, the word model is rebuilt such that the connection to the sub-category node from the original category node is favourably weighted. This in turn means that when compound weightings are calculated for indirect connections to nodes related to the sub-category node, the compound weightings are also recalculated more favourably.
On the other hand, if a user selects a sub-category label as irrelevant, the node corresponding to the sub-category label is not 'selected' like the node corresponding to the originally selected category label. The weightings of direct and indirect connections from the sub-category label node and between the sub-category label node and the category label, node may however be recalculated unfavourably. For example, if the sub-category label node previously had a connection weighting of 0.5 with the originally selected category node, the ) weighting may be recalculated to 0.1 or 0 after the user selects the sub-category label in question as irrelevant. This in turn means that when compound weightings are calculated for indirect connections to nodes related to the sub-category label node, the compound weightings are also recalculated less favourably.
The user may select a plurality of sub-category labels as relevant or irrelevant to the topic or area they would like to focus on the search on. When the method is performed on a computer, the option to select a sub-category label as relevant or irrelevant may be provided to the user as an interactive soft key or series of soft keys. For example, a tick or plus key may represent the 'relevant' selection option for a subcategory label, and similarly a cross or minus key may represent the 'irrelevant' selection option for a category label.
An example of a user's selection of sub-categories is shown in Figure 4. Figure 4 shows a user interface for selecting or deselecting sub-categories, according to the category label 'Coverage'. The user is presented with related words from the semantic structure 300 as sub-category labels beneath the category label 401. The user may then select if the displayed sub-category labels are relevant or irrelevant. In the example shown in Figure 4, the relevant selection 402 and the irrelevant selection 403 are represented by ticks and crosses respectively. Once the user has selected the sub-category labels as relevant or irrelevant, the word model is rebuilt as discussed above.
The user may also select a 'secondary interpretation' option for a particular subcategory label. The 'secondary interpretation option' signifies that the primary meaning of the sub-category label in question is not relevant to the focus of the search per se, but a secondary interpretation of the sub-category label in question is relevant.
The effect of the 'secondary interpretation' option is a medium between the effects of selecting the relevant and irrelevant options discussed above. The weighting of connections to a sub-category label node that has been selected as a 'secondary interpretation' may be recalculated such that they are reduced, but not as reduced as connections to nodes that correspond to sub-category labels that are selected as irrelevant.
Furthermore, weightings of connections to a sub-category node selected as a 'secondary interpretation' may be increased.
Optionally, once a user selects a sub-category label, the process of recalculating the weightings and thus rebuilding the word model is automatic and occurs immediately after the selection from the user. This allows new sub-category labels to be provided to the user once they have made a selection on a previous sub-category label. These can then also be selected as relevant or irrelevant.
In an alternative embodiment, the user is not given the option to select sub-category labels and instead, sub-category labels are automatically selected to be included in the list of words to be translated into a vector set with the selected category-label according to the word model.
In step 102e of the method according to Figure 2, the list of words is formed according to the user's selection of category labels sub-category labels.
Once the word model is rebuilt according to the user's selections as discussed above with reference to step 102d, the category label node and the sub-category label nodes, which correspond to the labels selected as relevant by the user, are selected in the word model.
These nodes form the basis of the words in the list of words which is to be translated into vectors in step 103 of the method according to the present invention.
The word model is used to expand the number of words in the list of words from the selected category labels and sub-category labels. Related words that correspond to nodes that are directly or indirectly connected to the nodes of selected category labels and selected sub-category labels may be included in the list of words. Whether a related word is included in the list of words or not is dependent on the weighting for a direct connection, or compound weighting for an indirect connection, of the connection between the node corresponding to the related word and the nodes of selected category or sub-category labels. If the weighting is above a predetermined weighting threshold value, then the related word is included in the list of words.
The predetermined weighting threshold value may be different for different types of words. Pronouns for instance are usually specific and niche, and similar words can have very different meanings. Furthermore, when the user wants to focus on a particular economic entity, and selects a pronoun as a category or sub-category label, it would not be useful if the information returned by the method of the present invention did not relate specifically to that economic entity. Thus, the weighting threshold which determines what is included in the words list, may be higher for connections to nodes corresponding to pronouns than it is for verbs, adjectives, or regular nouns for example. The method may identify pronouns in the word model from capitalisation of the first letter of words in the word model.
Once steps 102b to 102e are performed, the words in the list of words are translated into a mathematical representation, such as vectors, according to step 103. As noted above it is to be understood that any suitable mathematical representation of the list of words may be used, including a matrix, polynomial or other mathematical function.
Figure 5 shows a more detailed view of step 103 of the method according to Figure 1. Step 103 will now be discussed in more detail with reference to Figure 5.
In step 103a, the list of words produced in step 102e is converted into a vector set.
Each word from the list of words is translated into an N dimensional vector, according to known techniques. The vector set corresponding to the list of words forms part of a NLP vector model. The NLP vector model comprises the vector set corresponding to the list of words as well as a plurality of other vectors that correspond to words not on the list of words, all of which exist in an N dimensional vector space.
The NLP vector model is a predetermined model which includes many predetermined vectors. The vector model used may be the Word2Vec model as discussed previously with reference to figure 1. The N dimensions of each vector provide information according to the context of the word to which it relates. Optionally, N is 300, such that each vector has 300 dimensions.
Whereas the word model is used in step 102 to provide suggestions to the user regarding sub-category labels, based on the initial category label selected by the user, the vector model provides a mathematical way of identifying related words that relate to the same semantic concept. The vector model is therefore a more rigorous tool for defining the concept or area of focus for the search.
Similarity between vectors in the vector model, and therefore conceptual similarity 3 between words, can be measured using the dot product or cosine distance between two vectors as discussed with reference to Figure 1.
In step 103b of Figure 5, vectors may be stemmed according to word groups. Word groups are made up of different forms of the same word such as, past, present, future, plural and singular conjugations. These versions of words may be combined to form a singular vector representative of a word group. The vectors may be combined by adding the vectors representative for each word within the word group. The vector model is thus stemmed such that the grammatical sense of words is not appreciated in much detail. This means that the size of the vector model and vector set corresponding to the list of words can be reduced, thus making the steps following the vector translation in step 103a more efficient.
The vector model may include hundreds of thousands of predetermined vectors. If a word in the list of words to be translated to a vector in step 103a is not in the vector model, because, for instance, it is a compound word, then collections of similar words may be averaged to produce a new vector, or the word may be approximated by a similar vector already in the vector model.
In step 103c of Figure 5, clustering is performed in the vector model to reduce the number of vectors associated with the focus of the search.
After translation from the words list, the corresponding vector set may be very large. Typically, the vector set corresponding to the words list may include 2000 vectors after translation.
Some of these vectors may not be very closely related in terms of meaning and context. The unrelated vectors are not useful in generating the search result document and may have the detrimental effect of allowing non-useful information to be included in the document.
On the other hand, some vectors not included in the vector set may be more closely related to the concept of the focus of the document generated and would be of benefit if is included in the vector set.
In order to identify and remove the unrelated vectors, and to incorporate more related ones into the vector set, a clustering algorithm can be used on the vector model.
By using a clustering algorithm, neighbouring, or similar vectors are compared to 'cluster centres' iteratively. The measurement used in this comparison is the dot product or cosine distance as discussed previously. The cluster centres represent average vector values for each cluster, and are recalculated based on new additions to the cluster. The cluster centres are chosen according to the type of clustering algorithm used. It is to be understood that k-means clustering, hierarchical clustering or algorithms of the like may be used in this method, as a skilled person would understand.
The number of clusters, and therefore cluster centres, may be predefined. Performing clustering may reduce the number of vectors in a cluster that defines the semantic concept of the search to approximately 100 to 200 vectors.
Once the clustering algorithm is run and completed on the vector model, the cluster associated with the focus of the search result document to be generated is retrieved. The cluster may have fewer or more vectors than the original vector set after translation from the words list. Preferably, the cluster has fewer or equal to 200 vectors.
The retrieved cluster is then compared to the content of the database according to step 104 of the method according to Figure 1.
Step 104 of the method according to Figure 1 will now be described in detail. In step 104, the database that is connected or uploaded to in step 101 is searched and compared with the vectors in the retrieved vector cluster.
Common words are first removed from the database content, such that the time taken to search the database content for words corresponding to the vectors in the retrieved cluster can be reduced. The common words are predetermined.
The remaining content of the database is then searched according to a 'first search' for occurrences of words that correspond to the vectors within the retrieved vector cluster.
In one embodiment, this is done by translating the vectors in the retrieved vector cluster back into a second list of words, and performing a simple search for the second list of words within the database content. This embodiment has the advantage of being quick in comparison to performing vector comparison analysis.
In an alternative embodiment, words from the database content are translated into vector format, and the vectors from the retrieved vector cluster are compared to the vectors translated from the database content. This alternative embodiment has the advantage of being able to compare the database content with the retrieved vector cluster mathematically, which is more rigorous, and provides a better measure of semantic similarity.
Optionally, the content of the database may also be searched for specific 'star' words selected by the user. These 'star' words are selected by the user during step 102 of the method, when the user selects category labels and sub-category labels. A 'star word skips step 103 such that it is searched for in the database content regardless of its mathematical representation in the vector model. When a user 'stars' a word, he or she overrides the method such that the 'star' word is forcibly searched for in the database. It is to be understood that the term 'star' word is arbitrary and is only termed to distinguish it from the relevant and/or irrelevant selected words discussed with reference to step 102 of the method.
In the first search of the database content according to step 104, when a word or group of words are matched to a vector within the vector cluster, the location of the matched word or group of words within the database is recorded. This may include recording the file, paragraph and/or sentence in which a match is observed within the database. This process repeats for every match within the database.
Step 105 of the method according to Figure 1 will now be described in detail. In step 105, the matches between the retrieved vector cluster and the database content, recorded in 30 step 104, are ranked or otherwise scored.
In order to rank or score the matches recorded in step 104, a more thorough analysis of the database content is performed in step 105.
Firstly, the common words removed in step 104 are replaced in the database content, to their original locations.
Optionally, common words are only replaced in the portions of the database content in which a match was recorded in step 104. Alternatively, the whole database is subject to the replacement of common words.
The portions of the database content may be files, paragraphs or individual sentences from the database. In step 105, these portions are subject to a 'second search'. The second search is more detailed than the first search of step 104, and analyses each word in the portion of the database content in which a match was recorded in step 104.
Preferably, the paragraph in which a match is observed is recorded in step 104 and this paragraph is then split into its constituent sentences in step 105 of the method in order to be analysed. The second search measures to what extent does the match between the word or group of words in the content of the database and the vector cluster exist throughout each sentence of the paragraph in question. The second search therefore looks at how many matches are in each sentence and/or paragraph of each portion of the database content.
Sentences with multiple matches and/or words with a high semantic similarity to the retrieved vector cluster will be ranked higher by the second search than sentences with one random match that is disparate from the rest of the teaching or subject of the sentences. In other words, the second search takes semantic consistency into account so that sentences that consistently focus on the semantics of the vector cluster are ranked higher than sentences that are vague or inconsistent.
The ranking or scoring may be a numeric rating. A portion of the rating may equate to the number of matches in a sentence and a portion of the rating may equate to semantic consistency within the sentence.
In one embodiment, the text content is broken up into paragraphs and then to sentences. Each sentence is evaluated as a whole to determine a score for its semantic relevance according to the each of the chosen search concepts. The sentence scores are then aggregated to determine a combined score for each paragraph.
The score of a sentence and/or paragraph may comprise a combination of a ranking score for the search concept, and a subject-matter score for the search concept. The ranking score for the search concept ranks the sentences and/or paragraphs in a document according to their relevance to the search concept. For instance, if the search concept is 'artificial intelligence', and only two sentences and/or paragraphs in a document discuss artificial intelligence, then the two sentences and/or paragraphs will be ranked higher than those that do not discuss artificial intelligence.
The subject-matter score scores each sentence and/or paragraph based on whether the main focus of the sentence and/or paragraph is related to the search concept. For instance, if a paragraph only mentions artificial intelligence once, and the majority of the paragraph is about annual profits, then the paragraph will be scored low with respect to the search concept.
By combining these two scoring techniques, the method of the present invention ensures that paragraphs and/or sentences that score highly are ones that discuss the concept of the search and focus mainly on the concept of the search.
Step 106 of the method according to Figure 1 will now be discussed in more detail. In step 106 of the method, the ranking or scoring of portions of content from the database are used to inform what portions of the database content are included in the search result document. The search result document is intended to focus on the concept of the user's initial selections of categories and sub-categories according to step 102 of the method, and as such, the content to be included in the search result document should score highly in step 105 of the method.
The search result document may include whole sentences or paragraphs from the database content according to the ranking or scoring of sentences and paragraphs in step 105.
Whether a sentence or paragraph from the database content is included in the search result document may be decided by comparison to a predetermined threshold score. If the score of a sentence or paragraph according to step 105 is higher than the predetermined threshold score, then said sentence or paragraph is to be included in the search result document. For instance, the threshold may be a score of 2. A sentence with 3 matches to words within the retrieved vector cluster may be given a score of 3 according to step 105, provided a score of 1 is given per word match in a sentence. Since the score of the sentence, 3, is greater than the predetermined threshold score, 2, the sentence is designated for being included in the search result document.
Alternatively, whether a sentence or paragraph from the database content is included in the search result document may be decided according to a ranking of the scores for sentences or paragraphs from step 105. In particular, the highest scoring X number of ) sentences or paragraphs from step 105 of the method may be included in the search result document, where X is a positive real integer.
The order of sentences and/or paragraphs determined to be in the search result document may also be decided according to this ranking, with the highest ranked sentence or paragraph appearing first, and the lowest ranked sentence or paragraph appearing last. It is to be understood that the ordering of sentences and/or paragraphs in the search result document may however be set by the user, or may be predetermined according to the subject matter of the sentences and/or paragraphs.
Once the database content for inclusion in the search result document is decided, by 35 comparison to the predetermined threshold score or by being in the top X sentences and/or paragraphs, then the search result document is generated. The search result document includes a direct copy of the content for inclusion from the database content.
The user can then access the search result document, in order to obtain information on their initial area on which they intended to focus on according to the selections made in the method step 102.
An example of a portion of text to be included in the search result document can be seen in Figure 6. Figure 6 shows highlighted portions of text 601 that have been scored according to their relevance to the category label, through comparison to the set of vectors. Each of the highlighted portions of text are attributable to a source within the database content.
Optionally, once the method has been completed and the result document has been produced, the method according to the present invention may further comprise rebuilding the word model and the vector model according to the users previous choices. In particular, the selections of the user in step 102 may be recorded and suggested as selectable subcategories in a future use of the method for the same concept of search. Similarly, subcategories deselected by the user may not be shown to the user in future use for the same concept of search. Hence, the method may include a feedback-loop provision which modifies the weightings related to certain category and sub-category labels iteratively. In this case, the method, and in particular the word model according to the present invention updates itself according to previous iterations of the method.
It is to be understood that the method discussed above with reference to Figures 1 to 6 may be implemented as a computer program on a computer device or on a computer readable medium with instructions written thereon for performing the method.
According to a second embodiment of the invention, a system 700 for performing the method of searching for text content and producing a relevance score is discussed with reference to Figure 7. The system 700 may comprise one or more computer terminals. The system 700 includes a user interface 701, a database module 702, a word model 25 module 703, a mathematical model module 704, a database comparison module 705, and a report generator module 706.
The system 700 performs the method according to the first embodiment of the invention discussed with reference to Figure 1. The role of the system 700 in performing the method will now be discussed.
Firstly, a user 707 interacts with the system 700 through the user interface module 701.
The user 707 interacts with the user interface module 701 to connect to the database module 702. The database module includes the database, including one or more documents. These documents are in the portable document format (pdf) or the like. The database may initially be empty, in which case the user 707 may upload, save or otherwise input documents to the database through the user interface module 701. Alternatively, the database module 702 may be pre-defined with one or more documents pre-saved or uploaded to the database. The database within the database module 702 may be updated at any time by the user 707 through the user interface module 701. The database module 702 may be connected to from the user interface module 701 via the internet or other like methods of communication. In this case, the database may not from part of the system 700. Alternatively, the database module 702 may be managed on a server or within the system 700.
The user 707 also interacts with the user interface module 701 to input a selection of at least one category label. The user 707 selects at least one category label according to what topic or area they would like to focus the search on.
Once the user 707 selects a category label, the user interface module 701 communicates with the word model module 703. The word model module 703 contains the word model. The word model is searched according to the category label selected by the user 707.
In particular, the category label is connected to at least one sub-category label in the word model. When a specific category label is selected by the user 707 in the user interface module 701, the sub-category labels that are connected to the category label are retrieved from the word model in the word model module 703. The sub-category labels retrieved from the word model may then be provided back to the user 707 via the user interface module 701.
When the sub-category labels are provided back to the user 707, the user 707 can select the sub-category labels in the user interface module 701 depending on what area they would like to focus the search on. The user 707 may also identify whether the retrieved subcategory labels are relevant, irrelevant, or a secondary interpretation of the focus of the search, by interacting with soft-keys within the user interface module 701.
Once the user 707 selects at least one sub-category label in the user interface module 701, and optionally specifies whether the sub-category label is relevant, irrelevant or of a secondary interpretation, the at least one sub-category label is then identified in the word model of the word model module 703. The selected words in the word model now include the originally selected at least one category label and the selected sub-category labels. The word-model may then searched and built to expand the number of selected words to also include words that are similar in meaning or otherwise related to the selected category and sub-category labels.
Once the word model is expanded, the selected and related words in the word model are used to form a words list, which is sent to the mathematical model module 704 from the word model module 703. The words list is then translated into a mathematical representation such as numerical vectors in a natural language processing (NLP) vector model, by the mathematical model module 704.
Once the selected words in the word model are translated to vectors in the NLP vector model, a clustering algorithm is used to either expand or contract the number of vectors to a predetermined number of vectors. The cluster produced by the clustering algorithm includes some vectors according to the selected words from the word model. Vectors that do not correspond to the selected words originate from the NLP vector model, and are included in the cluster or disregarded according to similarity to the vectors that correspond to the selected words. The measure of similarity within clusters is determined using the dot product.
The vector cluster produced by the clustering algorithm in the mathematical model module 704 is sent to the database comparison module 705. The database comparison module 705 communicates with both the mathematical model module 704 and the database module 702. The database in the database module 702 is searched and compared with the vectors in the vector cluster of the mathematical model module 704. Two comparisons are performed by the database comparison module 705. Firstly, the database comparison module 705 performs a first search of the database content with common words removed. Common words include words such as 'and', 'it' and 'the' for example. The first search looks for occurrences of words that correspond to the vectors within the vector cluster. In the first search, when a word or group of words are matched to a vector within the vector cluster, the location of the matched word or group of words within the database is recorded. This may include recording the file, paragraph and/or sentence in which a match is observed within the database. This process repeats for every match within the database.
The database comparison module 705 then performs a second search of the database within the database module 702. In the second search, the portions of the database content identified as containing matches in the first search are ranked or otherwise scored according to their similarity with the vector cluster from the mathematical model module 704. The portions may be files, paragraphs or individual sentences of content from the database. In the second search, the common words of the database portions are not removed as they were in the first search.
The second search measures to what extent does the match between the word or group of words in the content of the database and the vector cluster exist throughout each sentence of the paragraph of the database portions. Sentences with multiple matches and/or words with a high semantic similarity to the vector cluster will be ranked higher by the second search than sentences with one random match that is disparate from the rest of the teaching or subject of the sentences. In other words, the second search takes semantic consistency into account so that sentences that consistently focus on the semantics of the vector cluster are ranked higher than sentences that are vague or inconsistent.
Once the second search is completed by the database comparison module 705 and 35 the portions of matches between the vector cluster and the database are ranked or scored, the ranking or scoring of portions are used to inform what portions are included in a search result document. The ranking and scoring of database portions are provided from the database comparison module 705 to the report generator module 706.
The report generator module 706 determines what portions of the database content are included in the search result document based on the scoring provided by the database comparison module 705. Once the database content for inclusion in the search result document is decided the document is generated by the report generator module 706. The search result document includes a direct copy of the portions of the database content.
The user 707 can then access the search result document, in order to obtain information on their initial area on which they intended to focus on according to the selections 10 the user made on the user interface module 701.
The system 700 uses the combination of a word model and vector model to find and reproduce text from a database into a concise document. The use of both a word model and a vector model allows the system 700 to perform this task accurately and efficiently without the need for excessive training data.
It is to be understood that the system 700 may be comprised by a computer device, such as a personal computer, mobile phone, tablet or the like. Furthermore, the modules 701 to 706 of the system 700 may be comprised by the same or separate computer processors. A detailed flow diagram of the method according to the invention is given in Figure 8. Figure 8 corresponds to Figure 1 described previously. In step 801 of Figure 8, the method includes obtaining a first input from a user according to a topic for a search. In step 802 of Figure 8, the method includes searching a word model, for candidate words semantically related to the first input and outputting the candidate words. In step 803 of Figure 8, the method includes obtaining a selection from the user of one or more of the candidate words. In step 804 of Figure 8, the method includes forming a selected words set for inclusion in the search based on the selection from the user of the one or more candidate words. In step 805 of Figure 8, the method includes translating the selected words set into a set of mathematical representations based on the selected words set. In step 806 of Figure 8, the method includes searching a document database for one or more documents having a word or group of words having individual mathematical representations corresponding to members of the set of mathematical representations. Finally, in step 807 of Figure 8, the method includes retrieving from the document database, a portion of the one or more documents containing the word or group of words that correspond to members of the set of the mathematical representations. The description above is intended to provide an illustration of an exemplary embodiment of the claimed invention, and not to limit the scope of protection afforded by the attached claims. Various modifications of the embodiment within the wording of the claims will occur to the skilled person.

Claims (18)

  1. CLAIMS1. A computer implemented method of text based searching, the method comprising: obtaining a first input from a user according to a topic for a search; searching, in a predetermined word model, for candidate words semantically related to the first input, and outputting a plurality of candidate words semantically related to the first input; obtaining a selection from the user of one or more of the candidate words; forming a selected words set for inclusion in the search based on the selection from the user of the one or more candidate words; translating the selected words set into a set of mathematical representations based on the selected words set; searching a document database for one or more documents having a word or group of words having individual mathematical representations corresponding to members of the set Of mathematical representations; and retrieving from the document database, a portion of the one or more documents containing the word or group of words that correspond to members of the set of the mathematical representations.
  2. 2. The computer implemented method of claim 1, wherein the set of mathematical representations is a set of vectors in a natural language processing vector model.
  3. 3. The computer implemented method of claim 2, wherein forming the selected words set comprises: calculating a weighting value for each candidate word, wherein the weighting value is a measure of semantic similarity between the first input and the candidate word; and if the weighting value for a particular candidate word is greater than a predetermined weighting threshold, including the particular candidate word in the selected words set for inclusion in the search.
  4. 4. The computer implemented of any of claims 2 to 3, wherein forming the selected words set comprises: updating the respective weighting value of each of the one or more candidate words based on the selection from the user.
  5. 5. The computer implemented method of claim 4, wherein obtaining a selection from the user of one or more of the candidate words further comprises: obtaining from a user a positive selection for one or more candidate words, denoting that the one or more candidate words are relevant to the topic for the search; or obtaining from a user a negative selection for one or more candidate words, denoting that the one or more candidate words are irrelevant to the topic for the search; and in the case of a positive selection of one or more candidate words, consequently increasing the weighting value for the one or more candidate words and for the words semantically related to the one or more candidate words in the word model; or in the case of a negative selection of one or more candidate words, consequently decreasing the weighting value for the one or more candidate words and for the words semantically related to the one or more candidate words in the word model.
  6. 6. The computer implemented method of any of claims 2 to 5, further comprising: performing clustering on the natural language processing vector model to modify the vector set prior to searching the document database; wherein the vector set is modified to: include more vectors from the natural language processing vector model depending on the similarity between the vectors in the vector set and vectors from the vector model; or exclude vectors from the vector set, depending on the similarity between the vectors within the vector set.
  7. 7. The computer implemented method of claim 6, wherein the similarity between vectors is determined by calculating the dot product or cosine distance of the vectors.
  8. 8. The computer implemented method of any of claims 2 to 7, wherein searching the document database further comprises: performing a first search of the document database, wherein the first search identifies occurrences of words that correspond to words designated by vectors in the vector set; recording the location in the document database of occurrences of the words identified in the first search of the document database; and performing a second search of portions of the document database corresponding to the locations identified in the first search, wherein the second search calculates the similarity between the vectors in the vector set and the vectors in the natural language processing vector model that correspond to the words or groups of words in the identified portion of the document database.
  9. 9. The computer implemented method of any of claims 2 to 8, wherein searching a document database further comprises: assigning a score to portions of documents in the document database based on the similarity between the vector set and the vectors of words or group of words appearing in the portion of the document according to the natural language processing vector model, and wherein the retrieving step is performed on the basis of the assigned score.
  10. 10. The computer implemented method of claim 9, wherein the retrieving step comprises: outputting one or more portions of documents according to the assigned score, along with, for each retrieved portion of a document, the respective assigned score, wherein the portion of a document is a paragraph of a document and/or a sentence.
  11. 11.. The method of searching for text according to claim 10, wherein the retrieving step further comprises: either retrieving each portion of the document database that has a score above a predetermined scoring threshold; or retrieving the top X scoring portions of the database, wherein X is a real positive integer.
  12. 12. The computer implemented method according to any of claims 9 to 11, wherein the score of a portion of a document is assigned according to at least one of: how many occurrences of words identified in the first search there are in the portion of the document; how similar the vectors from the vector model, corresponding to each word in the portion of the database, are to the vectors of the vector set; and how consistently similar the vectors are in the portion of the document as a whole.
  13. 13. The computer implemented method of claim 8, wherein common words are removed from the database prior to the first search of the database.
  14. 14. The method of claim 13, wherein, before the second search, the common words are reintroduced to the portions of the database including an occurrence of a word identified in the first search.
  15. 15. A non-transitory computer readable storage medium including instructions stored thereon executable by a processor, the instructions configured to carry out the method according to any of claims 1 to 14.
  16. 16. A computer system for text based searching, the system comprising: a user interface module configured to obtain a first input from a user according to a topic for a search; a word model module configured to search, in a predetermined word model, for candidate words semantically related to the first input, and outputting a plurality of candidate words semantically related to the first input; wherein the user interface module is configured to obtain a selection from the user of one or more of the candidate words; wherein the word model module is configured to a selected words set for inclusion in the search based on the selection from the user of the one or more candidate words; a mathematical model module configured to translate the selected words set into a set of mathematical representations based on the selected words set; a document database comparison module configured to search a document database for one or more documents having a word or group of words that have individual mathematical representations corresponding to members of the set of mathematical representations; and wherein the document database comparison module is configured to retrieve from the document database, a portion of the one or more documents containing the word or group of words that are correspond to the members of the set of mathematical representations.
  17. 17. The computer system for text based searching according to claim 16, wherein the set of mathematical representations is a set of vectors in a natural language processing model.
  18. 18. The computer system of claim 17, configured to perform the method of any of claims 2 to 14.
GB1901832.4A 2019-02-08 2019-02-08 Improved method and system for text based searching Withdrawn GB2586002A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB1901832.4A GB2586002A (en) 2019-02-08 2019-02-08 Improved method and system for text based searching
PCT/GB2020/050296 WO2020161505A1 (en) 2019-02-08 2020-02-10 Improved method and system for text based searching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1901832.4A GB2586002A (en) 2019-02-08 2019-02-08 Improved method and system for text based searching

Publications (2)

Publication Number Publication Date
GB201901832D0 GB201901832D0 (en) 2019-04-03
GB2586002A true GB2586002A (en) 2021-02-03

Family

ID=65998488

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1901832.4A Withdrawn GB2586002A (en) 2019-02-08 2019-02-08 Improved method and system for text based searching

Country Status (2)

Country Link
GB (1) GB2586002A (en)
WO (1) WO2020161505A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560425B (en) * 2020-12-24 2024-04-09 北京百度网讯科技有限公司 Template generation method and device, electronic equipment and storage medium
CN113283760B (en) * 2021-05-31 2023-04-18 浙江环玛信息科技有限公司 Case flow analysis report generation method and system
CN114298055B (en) * 2021-12-24 2022-08-09 浙江大学 Retrieval method and device based on multilevel semantic matching, computer equipment and storage medium
CN114443850B (en) * 2022-04-06 2022-07-22 杭州费尔斯通科技有限公司 Label generation method, system, device and medium based on semantic similar model
CN115964658B (en) * 2022-10-11 2023-10-20 北京睿企信息科技有限公司 Classification label updating method and system based on clustering

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010003015A (en) * 2008-06-18 2010-01-07 Hitachi Software Eng Co Ltd Document search system
US11442977B2 (en) * 2015-03-24 2022-09-13 International Business Machines Corporation Augmenting search queries based on personalized association patterns

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
None *

Also Published As

Publication number Publication date
WO2020161505A1 (en) 2020-08-13
GB201901832D0 (en) 2019-04-03

Similar Documents

Publication Publication Date Title
Akhtar et al. Feature selection and ensemble construction: A two-step method for aspect based sentiment analysis
Gambhir et al. Recent automatic text summarization techniques: a survey
US9317569B2 (en) Displaying search results with edges/entity relationships in regions/quadrants on a display device
WO2020161505A1 (en) Improved method and system for text based searching
US20180032606A1 (en) Recommending topic clusters for unstructured text documents
Alami et al. Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling
US9684683B2 (en) Semantic search tool for document tagging, indexing and search
JP5391633B2 (en) Term recommendation to define the ontology space
US8190541B2 (en) Determining relevant information for domains of interest
US7813919B2 (en) Class description generation for clustering and categorization
Zhang Towards efficient and effective semantic table interpretation
US20020073079A1 (en) Method and apparatus for searching a database and providing relevance feedback
JP5137567B2 (en) Search filtering device and search filtering program
US20200042508A1 (en) Artificial intelligence system and method for auto-naming customer tree nodes in a data structure
Smith et al. Evaluating visual representations for topic understanding and their effects on manually generated topic labels
CN113901173A (en) Retrieval method, retrieval device, electronic equipment and computer storage medium
JP7110554B2 (en) Ontology generation device, ontology generation program and ontology generation method
Ghodratnama Towards personalized and human-in-the-loop document summarization
Ganapathy et al. Intelligent Indexing and Sorting Management System–Automated Search Indexing and Sorting of Various Topics [J]
Kumar et al. Extractive Text Summarization using Meta-heuristic Approach
Hisazumi et al. Feature Extraction from Japanese Natural Language Requirements Documents for Software Product Line Engineering
Dao Job clustering: an unsupervised approach for a recommender syste m of skill requirements
Sizov Extraction-based automatic summarization: Theoretical and empirical investigation of summarization techniques
Sharma Hybrid Query Expansion assisted Adaptive Visual Interface for Exploratory Information Retrieval
Brajković et al. Evaluating Text Summarization Using FAHP and TOPSIS Methods in Intelligent Tutoring Systems

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)