WO2012143839A1 - Système et procédé informatisés de traitement et de construction de chaînes de recherche - Google Patents

Système et procédé informatisés de traitement et de construction de chaînes de recherche Download PDF

Info

Publication number
WO2012143839A1
WO2012143839A1 PCT/IB2012/051870 IB2012051870W WO2012143839A1 WO 2012143839 A1 WO2012143839 A1 WO 2012143839A1 IB 2012051870 W IB2012051870 W IB 2012051870W WO 2012143839 A1 WO2012143839 A1 WO 2012143839A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
words
signature
database
text
Prior art date
Application number
PCT/IB2012/051870
Other languages
English (en)
Inventor
Abraham Carel GREYLING
Original Assignee
Greyling Abraham Carel
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Greyling Abraham Carel filed Critical Greyling Abraham Carel
Publication of WO2012143839A1 publication Critical patent/WO2012143839A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • This invention relates to query processing, and more specifically relates to the semantic analysis of search query strings to generate multiple alternative strings to facilitate improved computerized search.
  • search engines use text-based input search queries. To return accurate search results, the search engine must be able to apply some form of language interpretation to the search string entered by a user. The search engine must also apply language interpretation when it indexes web pages or other documents, so that the search string can be matched to web pages by a ranking algorithm that only delivers the most relevant results to a user.
  • the "Semantic Web” refers to a structure for the Internet in which machine- readable data (or meta-data) is available that tells a computer unambiguously what a web page, a document or a topic is about. This meta-data enables computers to understand the meaning of information directly, without the interpretation problems that plague current search engines.
  • machine- readable data or meta-data
  • This meta-data enables computers to understand the meaning of information directly, without the interpretation problems that plague current search engines.
  • certain defined domains - for example, airline booking systems - operate in this way.
  • JFK in an airline booking system refers only to John F Kennedy International airport in New York, not to the former US president or other terms that may have these three letters as their acronym.
  • Some search engines such as BingTM, identify categories based on the search terms, and a user is able to filter out irrelevant results by only selecting certain categories.
  • a search for "chicken” might identify categories of “animals” and “recipes” and allow the user to filter so as to only search within one of the two categories.
  • the goal of the Internet itself being semantic has not yet been realized, despite ongoing efforts to index and associate concepts on the Internet.
  • the main problem is the complexity of the task involved in performing such identification and association on the open Internet, which requires a huge structured database to be built.
  • each unique word identified in the text is stored in the database and is associated with a number of fields which each represent other words which were found to occur adjacent to that word in the text, each field also including a frequency sub-field which indicates how frequently that other word was found to occur adjacent to the associated word;
  • the word relationship database processing the word relationship database so as to determine a forward signature and reverse signature for each word, the forward signature including a ranked list of the words that were found to come after that word in the text, and the reverse signature including a ranked list of the words that were found to come before that word in the text;
  • Still further features of the invention provide for the method to include an additional step of, immediately after extracting text and before forming a word relationship database, parsing the text into sentence portions which start and end with sentence delimiters.
  • Still further features of the invention provide for inputting the multiple alternate search strings into a search engine simultaneously or in rapid succession and comparing the results of each separate search so as to rank the overall results and present those results which were obtained in the greatest number of separate searches as the most relevant search results.
  • the language of the text is preferably identified so that separate word relationship databases and signature databases can be built for each separate language.
  • the invention extends to a system for processing an input search string and building multiple alternative search strings, comprising:
  • a processor in the form of a server which is able to access a multitude of web pages or other documents through the Internet and extract text, the text including words;
  • each unique word in the word relationship database being associated with a number of fields which each represent other words which were found to occur adjacent to that word in the text, each field also including a frequency sub-field which indicates how frequently that other word was found to occur adjacent to the associated word;
  • a signature database coupled to the server, the signature database being formed by the server processing the word relationship database so as to determine a forward signature and reverse signature for each word, the forward signature including a ranked list of the words that were found to come after that word in the text, and the reverse signature including a ranked list of the words that were found to come before that word in the text, and combining the forward and reverse signatures of each word to form an ambidextrous signature for each word that is stored in the signature database;
  • the server to be configured to input the multiple alternative search strings into a search engine simultaneously or in rapid succession, and to compare the results of each separate search so as to rank the overall results and present those results which are obtained in the greatest number of separate searches as the most relevant search results.
  • Figure 1 is a flowchart that illustrates the overall steps performed in obtaining improved search results where multiple alternative search strings are generated according to the method of the invention
  • Figure 2 is a schematic diagram showing the system for generating multiple alternative search strings according to the invention.
  • Figure 3 is a flowchart that illustrates the steps performed in creating and updating a word relationship database
  • Figure 4 is a flowchart that illustrates the steps performed in creating a signature database based on the word relationship database; and Figure 5 is a flowchart that illustrates the steps performed in using the signatures in the signature database to create multiple alternative search strings according to the method of the invention.
  • Figure 1 is a flowchart that illustrates the overall steps performed and results obtained by the method and system of the invention.
  • a search string is input into the system of the invention.
  • multiple alternative search strings are generated, with each search string being semantically similar to the original search string and containing correct grammar.
  • the original search string and each alternative search string are then input into a search engine at the next stage (24).
  • the search engine may be any computerized search engine, including a web based internet search engine that facilitates keyword searching.
  • the results of each internet search obtained using the search strings are obtained at the next stage (26). These results are then combined at the next stage (28) so as to identify the most relevant search results and output those results at stage (30).
  • Figure 2 illustrates a system (100) which enables the multiple alternative search strings to be generated, which was stage (22) in Figure 1.
  • the system includes a processor in the form of a server (102) which is able to access a multitude of web pages (104) or other documents through the Internet (106) by means of web crawling programs (not shown).
  • the server is also coupled to a word relationship database (108) and a signature database (1 10).
  • the word relationship database is built from content obtained from the Internet by the web crawling programs, and the word relationship database is used to create the signature database as will be explained below.
  • the first stage in the method of the invention is to create a word relationship database where every word in a particular language is associated with words adjacent to it.
  • a flow chart illustrating the steps to create and update a word relationship database is shown in Figure 3.
  • the Semantic Web is a technology waiting to be actualized. Application areas are experiencing intensified interest due to the rapid growth in the use of the Web. Information content technologies (such as search engines) are constantly being improved, with the hope of the actualization of powerful search technologies.”
  • sentence delimiters The following list of ASCII characters are generally regarded as sentence delimiters:
  • the delimiters in the text are "(", ")" and The text can therefore be parsed into the following sentence portions: a) The Semantic Web is a technology waiting to be actualized b) Application areas are experiencing intensified interest due to the rapid growth in the use of the Web
  • a word relationship database can then be formed which shows how adjacent words are related to each other in the body of text analyzed by the web spiders.
  • the word relationship database is formed as a two- dimensional matrix in which each row represents a particular word (the "row word"), and has a number of row fields that represent specific words that were found to occur after the row word in the body of text that was analyzed.
  • Each row field also includes an indication of the frequency, or number of times, that the word was found to occur after the row word in the body of content. This can schematically be illustrated as follows:
  • ⁇ RowWord1 > ⁇ WordAfter1 >, ⁇ Freq1 >
  • ⁇ RowWord2> ⁇ WordAfter1 >, ⁇ Freq1 >
  • ⁇ RowWord3> ⁇ WordAfter1 >, ⁇ Freq1 >
  • Each word is assigned a unique reference number, and within each row the row fields are ranked according to frequency. For example, consider a very small portion of the two-dimensional matrix, only the words that follow alphabetically between "actuality” and “actualizes”. From the sentence portions (e) above, only the word "of” was found to follow the word “actualization”. In the sentence portion (a) no word was found after "actualized”. If the only text input into the two-dimensional matrix were the sentence portions (a) - (e) above, the matrix portion might look as follows:
  • Each row represents a unique word ("a row word”) and the rows are alphabetically sorted with incrementing reference numbers, words 501 -504 ("actuality” - "actualizes”).
  • Each row word has a number of row fields after it.
  • word 501 "actuality"
  • word 503 and 504 there are only 2 row fields following each of these words.
  • Each row field includes two items of information, the reference number of a word that was found to come after it in the body of searched text, and a frequency number which shows the number of times that referenced word was found to come after the row word in the body of searched text.
  • This extract from the word relationship database is, of course, greatly simplified for illustrative purposes.
  • the word relationship database increases in size with the frequency numbers growing rapidly and the number of row fields also growing, although not as quickly.
  • a number of techniques can be employed, such as techniques that gradually reduce the frequency fields so that only those field words that are frequently incremented will develop large frequencies.
  • Algorithms for periodically discarding the row fields that have very low frequencies can also be used so as to keep the number of row fields in check, in addition to algorithms that compress the matrix density (the number of row fields multiplied by their frequencies).
  • the word relationship database Once populated with content from a large number of web pages and other documents, the word relationship database provides an accurate view of the relationship that each word has to the words that come after it in a particular language (such as English), provided of course that the bulk of the content accessed by the web spiders is not garbled or meaningless, which it should not be if ordinary content on the Internet is being accessed.
  • a particular language such as English
  • the bulk of the content accessed by the web spiders is not garbled or meaningless, which it should not be if ordinary content on the Internet is being accessed.
  • a signature database is created that is based on the word relationship database.
  • Figure 4 illustrates the steps to create a signature database based on the word relationship database.
  • the words in the row field of the word relationship database only indicate words that come after the row word.
  • Each row can therefore be thought of as a signature for the words that follow the row word, where the signature tells you the relationship of the row word to other words following it, ranked according to popularity.
  • the word relationship database is queried to obtain the "reverse signature" of every word, i.e. an indication of the popularity of words that precede the word of interest. This can be done by searching the entire word relationship database for every instance where the word of interest appears in a row field, and identifying the row word associated with that row field as the preceding word.
  • the forward and reverse signatures of each word are combined into an "ambidextrous signature".
  • the information about whether the field word came before or after the word of interest is discarded, and the number of times each field word came before or after the word is also discarded, while nevertheless maintaining a ranking based on the number of times each field word came before or after the word.
  • the "forward signature" of "work” given at (4) and the “reverse” signature of "work” given at (6) are combined into the following "ambidextrous signature":
  • the ambidextrous signature (7) is therefore a word relationship signature which shows which words are contextually close to the word "work”, in that those words often appear adjacent to the word "work” (either before or after) in the English language.
  • word relationship signatures only reflect the relationship of specific words to those words that come immediately before or after them, not to more distant word relationships.
  • word relationship database and signature database are two-dimensional matrices, rather than 3-, 4- or higher-order matrices. This simplicity is important because it keeps the size of the word relationship database and signature database manageable and makes it very scalable.
  • the word relationship signatures in the signature database are used to create multiple alternative search strings that are semantically similar to an input search string and grammatically correct. This is the step that was indicated broadly by stage (22) in Figure 1 and which will now be described in detail. The various stages involved in generating the multiple alternative search strings are illustrated in Figure 5.
  • popular words are removed from the input search string.
  • Popular words are identified as those words with a total frequency in the entire word relationship database that is higher than a predetermined threshold - in other words, those words that appear very commonly in the total body of text accessed by the web crawling programs.
  • a predetermined threshold - in other words, those words that appear very commonly in the total body of text accessed by the web crawling programs.
  • the search string "Where can I get cool spring water?".
  • the words “where”, “can”, “I” and “get” will likely be identified as popular words, with the remaining words “cool spring water” being non-popular words.
  • the non-popular words are linked in two-word groups from left to right with the last word of any preceding two-word group forming the first word of the next two-word group. In this case, there are two two-word groups, namely "cool spring” and "spring water”.
  • each two word group is analyzed as follows: the reverse signature of the first word and the forward signature of the second word are obtained. Then, at stage (66), the forward and reverse group signatures are combined into a single ambidextrous "word-group" signature.
  • the ambidextrous word relationship signature in (10) gives the forward and reverse relationship of the two words "cool spring” in combination, as if they were a single word.
  • the signature database is searched to look for close signature matches for the ambidextrous "word group” signature (10).
  • This comparison can be done in various ways. One way is to calculate a matching score between the signature (10) and each of the signatures in the signature database by an algorithm that looks for matches between the fields of the signature (10) and the fields of each of the signatures in the signature database.
  • Decreasing weighting factors can be allocated to each of the fields with the signature so that matches between fields that are further to the right count less than matches between fields that are further left.
  • the algorithm can also allocate a higher weighting factor if the word in the signature database that includes matching fields is not a common word, as these words give more information than common words such as prepositions and conjunctions.
  • the word or words that have the highest weighting factor are then identified as the words that are semantically similar to the two-word group.
  • stages (64) to (70) are then repeated for each of the other two-word groups in the search string, which in this example is the second two-word group, "spring water".
  • the search string which in this example is the second two-word group, "spring water”.
  • one or more other words are identified that are semantically similar to "spring water”.
  • Combining the results of both iterations yields a number of two word strings that are each semantically similar to "cool spring water”. For example, if one of the words identified as semantically similar to "cool spring” was “refreshing” and one of the words identified as semantically similar to "spring water” was “liquid”, then "refreshing liquid” would be identified as semantically similar to "cool spring water”.
  • the invention includes additional steps by means of which grammatically incorrect alternative strings can be excluded. To do this, the substituted words are first substituted back into the original search string at stage (76). Then, at stage (78), each substituted word is analyzed within the original string to see whether the words preceding it and following it are words that are associated with the substituted word by a predefined degree.
  • the most relevant documents or web pages are then presented to the user first. It will be appreciated that from the perspective of the user of a search engine the invention described above is completely hidden and is carried out in the background. The user interacts with the search engine in exactly the same way as before - by typing in a search string - and the search engine generates the alternative search strings and identifies the most relevant documents to present to the user.
  • the applicant has found that the invention leads to a marked improvement in the quality of the results that are presented to a user. Irrelevant search results are excluded far more often than with existing search engines and complex sentence structures can be handled with more precision. Because multiple alternative search strings are generated based on the search string, the applicant has found that it is no longer necessary to substitute different words or attempt to re-write search strings with different sentence structures in an attempt to locate relevant results. This leads to increased user satisfaction and quicker location of relevant search results.
  • the system of the invention requires no human input to categorize and index content, not does it have to be programmed with complex morphological or grammatical rules or built-in dictionaries.
  • the invention provides a completely autonomous and extremely scalable system that is able to build a contextual language model of any contextual language so that search strings can be interpreted more accurately by search engines, so as to deliver more relevant and targeted search results without the need to categorize or index existing content.
  • the invention may be applied in web based search engines, it can also be applied in the enterprise search market where companies search their own internal documents and information.
  • any of the software components or functions described in this specification may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C++ or Perl using, for example, conventional or object-oriented techniques.
  • the software code may be stored as a series of instructions, or commands on a computer readable medium, such as a random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive, or an optical medium such as a CD-ROM.
  • RAM random access memory
  • ROM read only memory
  • magnetic medium such as a hard-drive
  • an optical medium such as a CD-ROM.
  • Any such computer readable medium may reside on or within a single computational apparatus, and may be present on or within different computational apparatuses within a system or network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé pour traiter une chaîne de recherche d'entrée et construire de multiples chaînes de recherche alternatives en vue d'améliorer la recherche informatisée. Le procédé consiste à extraire du texte de pages web, former une base de données relationnelle de mots dans laquelle chaque mot unique est associé à des champs qui représentent d'autres mots qui sont apparus adjacents à ce mot, traiter la base de données relationnelle de mots de manière à déterminer une signature avant et arrière pour chaque mot, et combiner les signatures avant et arrière pour former une base de données de signatures. Des groupes de deux mots dans la chaîne de recherche d'entrée sont liés et des signatures avant et arrière pour chaque groupe de deux mots sont obtenues. Ces signatures sont comparées à la base de données de signatures pour trouver des mots uniques qui ont des signatures qui correspondent sensiblement à la signature du groupe de deux mots, et les mots identifiés comme mots alternatifs qui sont sémantiquement similaires au groupe de deux mots, de manière à générer des chaînes de recherche alternatives.
PCT/IB2012/051870 2011-04-19 2012-04-16 Système et procédé informatisés de traitement et de construction de chaînes de recherche WO2012143839A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201161476917P 2011-04-19 2011-04-19
US61/476,917 2011-04-19

Publications (1)

Publication Number Publication Date
WO2012143839A1 true WO2012143839A1 (fr) 2012-10-26

Family

ID=47041106

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2012/051870 WO2012143839A1 (fr) 2011-04-19 2012-04-16 Système et procédé informatisés de traitement et de construction de chaînes de recherche

Country Status (1)

Country Link
WO (1) WO2012143839A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853742A (zh) * 2012-11-29 2014-06-11 北大方正集团有限公司 检索装置、终端和检索方法
CN110678860A (zh) * 2017-03-13 2020-01-10 里德爱思唯尔股份有限公司雷克萨斯尼克萨斯分公司 用于逐字文本挖掘的系统以及方法
CN114897576A (zh) * 2022-05-05 2022-08-12 深圳市极客智能科技有限公司 基于数据分析的商品推送方法
US11475015B2 (en) 2020-11-20 2022-10-18 Coupang Corp. Systems and method for generating search terms

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US20030171910A1 (en) * 2001-03-16 2003-09-11 Eli Abir Word association method and apparatus
US6850937B1 (en) * 1999-08-25 2005-02-01 Hitachi, Ltd. Word importance calculation method, document retrieving interface, word dictionary making method
US20060253427A1 (en) * 2005-05-04 2006-11-09 Jun Wu Suggesting and refining user input based on original user input

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US6850937B1 (en) * 1999-08-25 2005-02-01 Hitachi, Ltd. Word importance calculation method, document retrieving interface, word dictionary making method
US20030171910A1 (en) * 2001-03-16 2003-09-11 Eli Abir Word association method and apparatus
US20060253427A1 (en) * 2005-05-04 2006-11-09 Jun Wu Suggesting and refining user input based on original user input

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853742A (zh) * 2012-11-29 2014-06-11 北大方正集团有限公司 检索装置、终端和检索方法
CN110678860A (zh) * 2017-03-13 2020-01-10 里德爱思唯尔股份有限公司雷克萨斯尼克萨斯分公司 用于逐字文本挖掘的系统以及方法
CN110678860B (zh) * 2017-03-13 2023-06-09 里德爱思唯尔股份有限公司雷克萨斯尼克萨斯分公司 用于逐字文本挖掘的系统以及方法
US11475015B2 (en) 2020-11-20 2022-10-18 Coupang Corp. Systems and method for generating search terms
CN114897576A (zh) * 2022-05-05 2022-08-12 深圳市极客智能科技有限公司 基于数据分析的商品推送方法
CN114897576B (zh) * 2022-05-05 2024-04-19 深圳市极客智能科技有限公司 基于数据分析的商品推送方法

Similar Documents

Publication Publication Date Title
CN106844658B (zh) 一种中文文本知识图谱自动构建方法及系统
US8463593B2 (en) Natural language hypernym weighting for word sense disambiguation
KR100546743B1 (ko) 언어분석 기반 자동 질문/정답 색인 방법과 그 질의응답방법 및 시스템
US8229730B2 (en) Indexing role hierarchies for words in a search index
US8429184B2 (en) Generation of refinement terms for search queries
US7756855B2 (en) Search phrase refinement by search term replacement
US7113943B2 (en) Method for document comparison and selection
US20100094835A1 (en) Automatic query concepts identification and drifting for web search
US20160041986A1 (en) Smart Search Engine
US20110258212A1 (en) Automatic query suggestion generation using sub-queries
US20070185831A1 (en) Information retrieval
KR20010108845A (ko) 정보검색에서 질의어 처리를 위한 단어 클러스터 관리장치 및 그 방법
Yusuf et al. Query expansion method for quran search using semantic search and lucene ranking
JP5250009B2 (ja) サジェスチョンクエリ抽出装置及び方法、並びにプログラム
WO2012143839A1 (fr) Système et procédé informatisés de traitement et de construction de chaînes de recherche
Babekr et al. Personalized semantic retrieval and summarization of web based documents
Bhoir et al. Question answering system: A heuristic approach
CN111737413A (zh) 基于概念网语义的反馈模型信息检索方法、系统及介质
Lin et al. Biological question answering with syntactic and semantic feature matching and an improved mean reciprocal ranking measurement
CN111930880A (zh) 一种文本编码检索的方法、装置及介质
Raza et al. An improved semantic query expansion approach using incremental user tag profile for efficient information retrieval
Siemiński Fast algorithm for assessing semantic similarity of texts
Kashyapi et al. TREMA-UNH at TREC 2018: Complex Answer Retrieval and News Track.
Gyorodi et al. Full-text search engine using mySQL
CN115618087B (zh) 对多语言翻译语料进行存储、搜索和显示方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12774098

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12774098

Country of ref document: EP

Kind code of ref document: A1