WO2024075086A1 - System and method for hybrid multilingual search indexing - Google Patents
System and method for hybrid multilingual search indexing Download PDFInfo
- Publication number
- WO2024075086A1 WO2024075086A1 PCT/IB2023/060075 IB2023060075W WO2024075086A1 WO 2024075086 A1 WO2024075086 A1 WO 2024075086A1 IB 2023060075 W IB2023060075 W IB 2023060075W WO 2024075086 A1 WO2024075086 A1 WO 2024075086A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- language
- fragment
- fragments
- text
- tokens
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 239000012634 fragment Substances 0.000 claims abstract description 103
- 238000010801 machine learning Methods 0.000 claims 3
- 238000004458 analytical method Methods 0.000 abstract description 47
- 238000001514 detection method Methods 0.000 abstract description 25
- 238000012545 processing Methods 0.000 description 13
- 230000015654 memory Effects 0.000 description 11
- 238000013467 fragmentation Methods 0.000 description 10
- 238000006062 fragmentation reaction Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 230000004048 modification Effects 0.000 description 9
- 238000012986 modification Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 238000013459 approach Methods 0.000 description 7
- 230000003287 optical effect Effects 0.000 description 7
- 238000003058 natural language processing Methods 0.000 description 6
- 238000013500 data storage Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000006467 substitution reaction Methods 0.000 description 4
- 238000007792 addition Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 239000003550 marker Substances 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000008707 rearrangement Effects 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
Definitions
- This disclosure relates generally to computerized search systems.
- this disclosure relates to the search systems for searching multilingual objects.
- this disclosure relates to effective and efficient indexing and searching of multilingual objects.
- a search engine is a computer program used to index electronically stored information (referred to as a corpus) and search the indexed electronic information to return electronically stored information responsive to a search.
- Items of electronic information that form the corpus may be referred to interchangeably as (electronic) documents, files, objects, items, content, etc. and may include objects such as files of almost any type including document for various editing applications, emails, workflows, etc.
- a user submits a query and the search engine selects a set of results from the corpus based on the terms of the search query.
- the terms of search queries usually specify words, terms, phrases, logical relationships, fields to be searched, synonyms, variations, etc.
- the documents being indexed and searched may include objects having content in multiple languages. Moreover, this multilingual content may be included within the same document. Thus, a single document may include content in multiple languages. Current search systems are both inefficient and ineffective when it comes to such multilingual documents.
- objects being indexed and searched in a search system may include objects having content in multiple languages where a single object may include content in multiple languages. Search systems may be both inefficient and ineffective when it comes to such multilingual documents.
- indexing of objects is usually accomplished by performing language analysis of the object that results in a set of tokens (and associated positional information) to be indexed for an object.
- This language analysis involves the tokenization of the content of the object followed by the actual language processing to determine the tokens of the object actually being indexed (e.g., case conversion, accent folding, normalization, stemming, lemmatization, etc.).
- This same language analysis may be performed on a search query to determine search terms (e.g., the tokens resulting from performing language analysis on the search query) to be utilized when searching objects.
- search terms e.g., the tokens resulting from performing language analysis on the search query
- the index time and the query time language analysis should match to obtain accurate search results.
- this language analysis is heavily dependent on the language being analyzed. The analysis of text in one language using language analysis adapted for a different language will thus result in the production of an improper or non-sensical set of tokens.
- One approach is to maintain separate search systems (or sites) specific to each language.
- objects at each site are indexed and searched according to the associated language.
- a website from a provider at a “.fr” domain may index objects according to French such that those documents can be effectively searched using French
- a website from the same provider at a “de” domain may index objects according to German such that those documents can be searched using German, etc.
- such an architecture is extremely inefficient, as it requires the separate indexing of all documents according to each language it is desired to support, maintaining a separate index for each language for all of those documents, and the maintenance of a separate interface/site for each of those languages.
- Another approach to handle multilingual documents is to use a single language per object methodology.
- some language identification methodology is employed to identify the language of the document.
- a language identification methodology is applied to this initial portion to identify the language.
- a language analysis associated with the identified language is then applied to index the object.
- a language identification methodology is applied to a search query and a language analysis associated with a language identified for the search query used to determine search terms to utilize when searching. In this manner, only a single language analysis is applied to index documents and to determine search terms based on a search query.
- these single language approaches are heavily dependent on accurate language detection for both the indexing of the object and the search query (e.g., both precision and recall may be reduced if language is improperly detected).
- the included content that is in any other language than the single language identified to index that document may be mis-analyzed and mis-indexed (e.g., the content in languages other than the identified language is linguistically analyzed using language analysis adapted for a different language (i.e., the identified language) and the tokens from that improper analysis indexed for that object).
- any content of these multilingual documents that is in any language differing from the single identified language for the object cannot be effectively searched, as it has been (mis) indexed using language analysis adapted for another language. Additionally, this ineffectiveness may be exacerbated when the differing languages are linguistically far apart, such when content from a language using a Latin alphabet is included in a document with content in a language based on other characters (e.g., such as Chinese, Japanese, or Korean (CJK)).
- CJK Korean
- multilingual search system may analyze objects (and thus similarly search queries) using a multilingual object analyzer that fragments the text (content) of the object into one or more fragments. While almost any type of fragmentation desired may be utilized to determine fragments of text, in one embodiment this fragmentation can be based on a sentence boundary detection model using Natural Language Processing (NLP) or the like.
- NLP Natural Language Processing
- Such a model may, for example, be based on Apache’s OpenNLP (e.g., may be a pluggable model that can be trained and provided).
- such a sentence boundary detection model may be trained based on sentence markers for multiple languages (which be the same as, or a subset or superset of, the languages supported by the multilingual search system).
- These sentence markers may be sentence markers from, for example, Japanese, Korean, Chinese (CJK), or other languages which do not use a period (or whitespace in certain contexts) to delineate the end of a sentence.
- CJK Japanese, Korean, Chinese
- a pre-splitting of the text may be performed by splitting (e.g., by inserting whitespace or another character or marker) into the text where certain characters delineating the end of a sentence (e.g., such as in Japanese, Chinese or Korean) occur in the text.
- Such pre-splitting may thus make the fragmentation of the text of the object more effective.
- a language detection may be performed to identify a language associated with each fragment. Again, such language detection may be performed based on a language detection model, where the language detection model may be a (e.g., pluggable) model based on Apache’s OpenNLP.
- the language detection model may be a (e.g., pluggable) model based on Apache’s OpenNLP.
- the fragment can be provided to a language analyzer for the identified language.
- the tokens for that fragment identified by that language analysis can then be indexed for that object. In this manner, all the tokens identified for each of the fragments by the respective language analyzer used for each fragment may be stored for the object and indexed to create a multilingual index.
- the set of objects of the corpus may be searched by providing a search query to the multilingual object analyzer such that the search query is fragmented (if needed) into one or more fragments such that language detection and analysis can be performed on each of those fragments to determine the search terms. These search terms can then be applied against the multilingual index to determine the search results.
- embodiments may allow the efficient implementation of multilingual search. Specifically, when language detection or analysis fails it may fail only for a particular fragment (e.g., not for the entire document). Thus, the document may be effectively indexed and subsequently searched even when such detection or analysis fails for a fragment.
- certain embodiments may utilize search engines that utilize fields to store various aspects of objects and that analyze objects using analysis chains associated with such fields to create indexes based on such fields. Example of such search engines are Apache’s Solr and Elasticsearch.
- the implementation of embodiments as disclosed may advantageously allow the storage of multilingual tokens of an object to be associated with a single field for an object and thus may allow a multilingual search index to be built based on this single field associated for the object. This capability thus additionally allows a single analysis chain to be implemented in association with such a field where that single analysis chain can implement multilingual analysis for the content of that object.
- the architecture of such multilingual search systems allows the simple extensibility of such search systems by allowing users or other third parties to easily expand the multilingual search system to process documents including almost any language desired.
- third parties e.g., users of such search system
- FIGURE 1 is a block diagram depicting a computing environment in which one embodiment of a search system can be implemented.
- FIGURE 2 is a block diagram depicting one embodiment of a search engine.
- FIGURES 3A and 3B are a block diagram depicting one embodiment of an object analyzer.
- FIGURE 4 is a block diagram depicting an index.
- FIGURES 5A and 5B are flow diagrams depicting embodiments of methods for the indexing and search of multilingual documents.
- a search engine is a computer program or a set of programs used to index information (referred to as a corpus) and search for indexed information.
- a corpus a corpus
- search engine selects a set of results from the corpus based on the terms of the search query.
- FIGURE 1 depicts a block diagram illustrating one embodiment of a computing environment 100 with object search system 101 .
- Computing environment 100 includes an object repository 105 storing a corpus of objects 107 of interest (documents, images, emails or other objects that may be searched).
- Object repository 105 may comprise a file server or database system or other storage mechanism remotely or locally accessible by search system 101.
- search system 101 comprises a server having a central processing unit 112 connected to a memory 114 and storage unit 118 via a bus.
- Central processing unit 112 may represent a single processor, multiple processors, a processor(s) with multiple processing cores and the like.
- Storage unit 118 may include a non-transitory storage medium such as hard — disk drives, flash memory devices, optical media and the like.
- Search system 101 may be connected to a data communications network (not shown).
- Storage unit 118 stores computer executable instructions 119, index 124, and value storage 125.
- Computer executable instructions 119 can represent multiple programs and operating system code.
- instructions 119 are executable to provide an object analyzer 120 and search engine 122.
- Object analyzer 120 and search engine 122 may be portions of the same program or may be separate programs. According to one embodiment, for example, object analyzer 120 is a component of one system while search engine 122 is a separate program that interfaces with another system.
- object analyzer 120 and search engine 122 can be implemented on different computing systems and can, themselves, be distributed.
- Index 124 may include text (e.g., tokens) for objects.
- Index 124 can include a single index containing metadata and text, separate metadata and text indices or other arrangements of information. While shown as a single index, index 124 may include multiple indices. Further, index 124 may be associated with a particular field defined for objects such that the index 124 may include tokens determined for that field for each object.
- search engine 122 may utilize fields associated with aspects of objects and that analyze objects using analysis chains associated with such fields to create index 124 based on such fields.
- Example of such search engines are Apache’s Solr and Elasticsearch.
- object analyzer 120 may comprise an analysis chain associated with a field (e.g., a single field such as a text field) defined for objects 107 of the corpus stored in repository 105.
- Client computer system 130 may include components similar to those of the server of search system 101 , such as CPU 138, memory 136, and storage 140. Additionally, client computer system 130 may include executable instructions 132 to provide a user interface 134 that allows a user to enter a search query. The user interface may be provided through a web browser, file system interface or other program.
- object analyzer 120 analyzes objects in object repository 105 to determine information to be indexed in index 124 or to analyze search queries received through the query interface 134.
- Object analyzer 120 can send indexing instructions to search engine 122 to direct search engine 122 to add/modify/ delete metadata or text in index 124.
- object 107 being added to search system 101 is a text file
- the text or content of the file is indexed as well as information about the file.
- search engine 122 can search the information in index 124 to identify objects responsive to the search query and return a list or other representation of those objects to client computer 130.
- object analyzer 120 may be a multilingual object analyzer adapted to determine tokens to index (or search terms) for multilingual objects 107.
- object analyzer 120 is an analysis chain associated with a field (e.g., a single field such as a text field) defined for objects 107
- these determined tokens may be multilingual tokens associated with a field for that object 107 and used to create a hybrid index 124 (e.g., an index comprising tokens determine according to analysis of the content of an object according to multiple languages) associated with the field.
- FIGURE 2 depicts a diagrammatic representation of logical blocks for one embodiment of a search engine 122.
- Search engine 122 may provide an indexing interface 200 that receives indexing requests for objects or another source. Such an indexing request may specify an operation to be taken on index 124 for an object and data (e.g., such as tokens for the object being indexed) for that action.
- an application that generates an indexing request may be a document management system, a web site with a search capability such as an online store, or a desktop search program for email.
- an indexing request can take the form of an indexing object that includes a unique identification for an object, an operation, a metadata or text field affected and the metadata or text for the index.
- indexing operations may include adding, replacing, modifying and deleting information in the index, or combinations thereof.
- a distributor module 210 may distribute the indexing requests to indexing engine 220 that act on an indexing request to update index 124.
- This indexing engine 220 can thus determine the tokens to pass to the index 124 for the object using the object analyzer 120 or an indexing request may include the tokens to add to index 124 as determined by the object analyzer 120.
- Search engine 122 may also include a search interface 230 to receive queries.
- Search interface 230 may be configured to receive a search query, such that index 124 can be used to search for objects that meet the criteria set forth in the search query.
- search interface 230 may utilize an object analyzer to actually determine search terms to utilize for the received search query and provide those search terms to a search module 240.
- Search modules 240 are responsible for performing searches on an index 124, and performing tasks such as computing relevance score, sorting results, and retrieving metadata regions to return in a query.
- Federator 245 gathers the results from all search modules together, and generates a response to the query received through search interface 230.
- the embodiment of FIGURE 2 is provided by way of example.
- Search engine 122 may include any number of other modules or configurations to update and search an index.
- search modules 240 and indexing engines 220 may be a single module.
- Search engine 122 may be a portion of a larger program, such as a document management program, may be a separate program or may be implemented according to any suitable programming architecture.
- the processes of search engine 122 may be distributed across multiple computer systems.
- index 124 is illustrated as a single index, index 124 may comprise a set of smaller indexes. For example, a separate index could be used by each indexing engine.
- FIGURES 3A and 3B a block diagram of one embodiment of a multilingual object analyzer 320 that may be employed to determine tokens to index for an object, or search terms for a search query is depicted.
- the object to be indexed e.g., the content of the object
- This object 307 may include text in multiple languages. It will be noted here that while embodiments as described with respect to FIGURES 3A and 3B will be described with respect to an object, it will be noted that such descriptions may apply equally well to the analysis of search queries in the context of performing analysis of such search queries to determine search terms of such queries.
- Fragmenter 302 fragments the text (content) of the object 307 into one or more fragments 304. While almost any type of fragmentation desired may be utilized to determine fragments of text, in one embodiment this fragmentation can be based on a sentence boundary detection model using Natural Language Processing (NLP) or the like. Such a model may, for example, be based on Apache’s OpenNLP (e.g., may be a pluggable model that can be trained and provided).
- NLP Natural Language Processing
- such a sentence boundary detection model may be trained based on sentence markers for multiple languages (which be the same as, or a subset or superset of, the languages supported by the multilingual search system). These sentence markers may be sentence markers from, for example, Japanese, Korean, Chinese, or other languages which do not use a period (or whitespace in certain contexts) to delineate the end of a sentence.
- a pre-splitting of the text may be performed by pre-splitter 306 by splitting (e.g., by inserting whitespace or another character or marker) into the text where certain characters delineating the end of a sentence (e.g., such as in Japanese, Chinese or Koran) occur in the text.
- Such pre-splitting may thus make the fragmentation of the text of the object by fragmenter 302 more effective.
- a language detection may be performed by language detection module 308 to identify a language associated with each fragment 304.
- This language detection may be performed based on a language detection model, where the language detection model may be a (e.g., pluggable) model based on Apache’s OpenNLP.
- the fragment can be provided to a language analyzer 314 for the language identified for that fragment 304 (e.g., language analyzer 314a may be for German, analyzer 314b for French, 314c for English, 314n for Greek, etc.).
- language analyzer 314a may be for German, analyzer 314b for French, 314c for English, 314n for Greek, etc.
- each fragment 304 is analyzed by a language analyzer 314 adapted for the language of that fragment 304 and that fragment of text of the object 307 analyzed according to a language analyzer 314 appropriate for that language.
- the tokens 312 for that fragment identified by that language analyzer 314 can then be indexed for that object 307.
- all the tokens 312 identified for each of the fragments by the respective language analyzer 314 used for each fragment 304 may be stored for the object 307 and indexed to create a multilingual index (or provided to the search module or interface for conducting a search).
- object analyzer may be implemented as (or as part of) an analysis chain for a field associated with objects of a search system.
- the architecture of object analyzer may be extensible by allowing users or other third parties to easily expand the object analyzer to process objects based on almost any language desired.
- third parties e.g., users of such search system
- FIGURE 4 depicts one embodiment of a multilingual index.
- this index 424 (e.g., which may be a portion of a larger index utilized by a search system) may be associated with a single field 440a used for object analysis in a search engine such as Apache Solr, Elasticsearch or the like.
- Each object in the corpus may have an associated object identifier 414 (e.g., from which the object in the corpus can be located or obtained).
- This object identifier 414 can be associated with a set of fields 440, one of which may be a text field 440a associated with tokens 412 determined from the respective object identified by the object identifier 414.
- An analysis chain may be associated with the text field 440a (e.g., such as an analysis chain implemented by an embodiment of an object analyzer as discussed).
- index 424 may include the tokens 412 for that field 440a for each (or a subset) of the objects of the corpus that result from performing the analysis chain associated with the (e.g., single) field 440a on objects in the search system.
- index 424 may include a dictionary 402 and a posting list 404.
- the dictionary 402 may include tokens 412 determined each objects (e.g., by the multilingual object analyzer as described), such that the index 424 includes the tokens 412 determined for all the objects of the corpus that have been indexed.
- a (e.g., single) field 440a and corresponding dictionary 402 includes tokens 412 from multiple languages determined according to their respective associated linguistic analysis for the corresponding language.
- Each of the tokens 412 may be associated with one or more entries in the posting list 404.
- Each entry in the posting list 404 includes an object identifier 414 of an object in which the associated token 412 appears, along with an identifier of each location 416 in that object where that token 412 appears.
- FIGURE 5A a flow diagram for one embodiment of a method for creation of a multilingual index for objects of an index is depicted.
- the (each) object to be indexed e.g., the content of the object
- This object may (or may not) include text in multiple languages.
- the object can be provided to an object analyzer.
- the object analyzer may, for example, be implemented as an analysis chain for a single field associated with the object in the search system.
- the text (content) of the object can be fragmented into one or more fragments (STEP 504). While almost any type of fragmentation desired may be utilized to determine fragments of text, in one embodiment this fragmentation can be based on a sentence boundary detection model using Natural Language Processing (NLP) or the like. Such a model may, for example, be based on Apache’s OpenNLP (e.g., may be a pluggable model that can be trained and provided).
- NLP Natural Language Processing
- such a sentence boundary detection model may be trained based on sentence markers for multiple languages (which be the same as, or a subset or superset of, the languages supported by the multilingual search system). These sentence markers may be sentence markers from, for example, Japanese, Korean, Chinese, or other languages which do not use a period (or whitespace in certain contexts) to delineate the end of a sentence.
- a pre-splitting of the text may be performed by splitting (e.g., by inserting whitespace or another character or marker) into the text where certain characters delineating the end of a sentence (e.g., such as in Japanese, Chinese or Koran) occur in the text.
- Such pre-splitting may make the fragmentation of the text of the object more effective.
- the tokens associated with each of those fragments can then be determined (STEP 506). Specifically, for each fragment a language detection may be performed to identify a language associated with that fragment (STEP 508). This language detection may be performed based on a language detection model, where the language detection model may be a (e.g., pluggable) model based on Apache’s OpenNLP.
- the language detection model may be a (e.g., pluggable) model based on Apache’s OpenNLP.
- a language analyzer corresponding to the identified language may be applied to the text of that fragment to determine the tokens for that fragment (STEP 510)
- each fragment is analyzed by a language analyzer adapted for the language of that fragment, and that fragment of text of the object analyzed according to a language analyzer for that language.
- the tokens determined for each of the fragments of the objects can then be indexed (STEP 512). In this manner, all the tokens identified for each of the fragments by the respective language analyzer used for each fragment may be stored for the object and indexed to create a multilingual index (or provided to the search module or interface for conducting a search).
- FIGURE 5B depicts a flow diagram for one embodiment of a method for searching for objects using a multilingual index.
- a search query may be received (e.g., through a search interface or the like) (STEP 560).
- a similar analysis to that used to determine the tokens for an object may then be applied to determine the terms of the search query to be used to perform the search.
- the search query may be provided to the same (or a similar) object analyzer as that used for the objects of the search system.
- the text of the search query can thus fragmented into one or more fragments (STEP 514) and the tokens associated with each of those fragments determined (STEP 516) by identifying a language associated with that fragment (STEP 518) using a language detection model.
- a language analyzer corresponding to the identified language may then be applied to the text of that fragment of the search query to determine the tokens for that fragment (STEP 520).
- the tokens determined for each of the fragments of the search query can then be used as search terms to search the objects of the search system using the multilingual index (STEP 522).
- Embodiments discussed herein can be implemented in a computer communicatively coupled to a network (for example, the Internet), another computer, or in a standalone computer.
- a suitable computer can include a central processing unit (“CPU”), at least one read-only memory (“ROM”), at least one random access memory (“RAM”), at least one hard drive (“HD”), and one or more input/output (“I/O”) device(s).
- the I/O devices can include a keyboard, monitor, printer, electronic pointing device (for example, mouse, trackball, stylist, touch pad, etc.), or the like.
- ROM, RAM, and HD are computer memories for storing computer-executable instructions executable by the CPU or capable of being complied or interpreted to be executable by the CPU. Suitable computer-executable instructions may reside on a computer readable medium (e.g., ROM, RAM, and/or HD), hardware circuitry or the like, or any combination thereof.
- a computer readable medium e.g., ROM, RAM, and/or HD
- the term “computer readable medium” is not limited to ROM, RAM, and HD and can include any type of data storage medium that can be read by a processor.
- a computer-readable medium may refer to a data cartridge, a data backup magnetic tape, a floppy diskette, a flash memory drive, an optical data storage drive, a CD-ROM, ROM, RAM, HD, or the like.
- the processes described herein may be implemented in suitable computer-executable instructions that may reside on a computer readable medium (for example, a disk, CD-ROM, a memory, etc.).
- the computer-executable instructions may be stored as software code components on a direct access storage device array, magnetic tape, floppy diskette, optical storage device, or other appropriate computer-readable medium or storage device.
- Any suitable programming language can be used to implement the routines, methods or programs of embodiments of the invention described herein, including C, C++, Java, JavaScript, HTML, or any other programming or scripting code, etc.
- Other software/hardware/network architectures may be used.
- the functions of the disclosed embodiments may be implemented on one computer or shared/distributed among two or more computers in or across a network. Communications between computers implementing embodiments can be accomplished using any electronic, optical, radio frequency signals, or other suitable methods and tools of communication in compliance with known network protocols.
- Any particular routine can execute on a single computer processing device or multiple computer processing devices, a single computer processor or multiple computer processors. Data may be stored in a single storage medium or distributed through multiple storage mediums, and may reside in a single database or multiple databases (or other data storage techniques).
- steps, operations, or computations may be presented in a specific order, this order may be changed in different embodiments. In some embodiments, to the extent multiple steps are shown as sequential in this specification, some combination of such steps in alternative embodiments may be performed at the same time.
- the sequence of operations described herein can be interrupted, suspended, or otherwise controlled by another process, such as an operating system, kernel, etc.
- the routines can operate in an operating system environment or as stand-alone routines. Functions, routines, methods, steps and operations described herein can be performed in hardware, software, firmware or any combination thereof.
- Embodiments described herein can be implemented in the form of control logic in software or hardware or a combination of both.
- the control logic may be stored in an information storage medium, such as a computer-readable medium, as a plurality of instructions adapted to direct an information processing device to perform a set of steps disclosed in the various embodiments.
- an information storage medium such as a computer-readable medium
- a person of ordinary skill in the art will appreciate other ways and/or methods to implement the invention.
- a “computer-readable medium” may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, system or device.
- the computer readable medium can be, by way of example only but not by limitation, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, system, device, propagation medium, or computer memory.
- Such computer-readable medium shall generally be machine readable and include software programming or code that can be human readable (e.g., source code) or machine readable (e.g., object code).
- non-transitory computer-readable media can include random access memories, read-only memories, hard drives, data cartridges, magnetic tapes, floppy diskettes, flash memory drives, optical data storage devices, compact-disc read-only memories, and other appropriate computer memories and data storage devices.
- some or all of the software components may reside on a single server computer or on any combination of separate server computers.
- a computer program product implementing an embodiment disclosed herein may comprise one or more non-transitory computer readable media storing computer instructions translatable by one or more processors in a computing environment.
- a “processor” includes any hardware system, mechanism or component that processes data, signals or other information.
- a processor can include a system with a general-purpose central processing unit, multiple processing units, dedicated circuitry for achieving functionality, or other systems. Processing need not be limited to a geographic location, or have temporal limitations. For example, a processor can perform its functions in “real-time,” “offline,” in a “batch mode,” etc. Portions of processing can be performed at different times and at different locations, by different (or the same) processing systems.
- the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” or any other variation thereof, are intended to cover a non-exclusive inclusion.
- a process, product, article, or apparatus that comprises a list of elements is not necessarily limited only those elements but may include other elements not expressly listed or inherent to such process, product, article, or apparatus.
- a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
System and method for the indexing and searching of multilingual documents are disclosed. In some embodiments, these multilingual search system may analyze objects (and thus similarly search queries) using a multilingual object analyzer that fragments the text (content) of the object into one or more fragments. For each of those fragments, a language detection may be performed to identify a language associated with each fragment. Once the language is identified for that fragment, the fragment can be provided to a language analyzer for the identified language. The tokens for that fragment identified by that language analysis can then be indexed for that object. In this manner, all the tokens identified for each of the fragments by the respective language analyzer used for each fragment may be stored for the object and indexed to create a multilingual index.
Description
SYSTEM AND METHOD FOR HYBRID MULTILINGUAL SEARCH INDEXING
TECHNICAL FIELD
[0001] This disclosure relates generally to computerized search systems. In particular, this disclosure relates to the search systems for searching multilingual objects. Even more specifically, this disclosure relates to effective and efficient indexing and searching of multilingual objects.
BACKGROUND
[0002] A search engine is a computer program used to index electronically stored information (referred to as a corpus) and search the indexed electronic information to return electronically stored information responsive to a search. Items of electronic information that form the corpus may be referred to interchangeably as (electronic) documents, files, objects, items, content, etc. and may include objects such as files of almost any type including document for various editing applications, emails, workflows, etc. In a conventional search engine, a user submits a query and the search engine selects a set of results from the corpus based on the terms of the search query. The terms of search queries usually specify words, terms, phrases, logical relationships, fields to be searched, synonyms, variations, etc.
[0003] Often times the documents being indexed and searched may include objects having content in multiple languages. Moreover, this multilingual content may be included within the same document. Thus, a single document may include content in multiple languages. Current search systems are both inefficient and ineffective when it comes to such multilingual documents.
SUMMARY
[0004] As discussed, objects being indexed and searched in a search system may include objects having content in multiple languages where a single object may include content in multiple languages. Search systems may be both inefficient and ineffective when it comes to such multilingual documents.
[0005] To illustrate in more detail, indexing of objects is usually accomplished by performing language analysis of the object that results in a set of tokens (and associated positional information) to be indexed for an object. This language analysis involves the tokenization of the content of the object followed by the actual language processing to determine the tokens of the object actually being indexed (e.g., case conversion, accent folding, normalization, stemming, lemmatization, etc.). This same language analysis may be performed on a search query to determine search terms (e.g., the tokens resulting from performing language analysis on the search query) to be utilized when searching objects. Thus, the index time and the query time language analysis should match to obtain accurate search results. As may be imagined, this language analysis is heavily dependent on the language being analyzed. The analysis of text in one language using language analysis adapted for a different language will thus result in the production of an improper or non-sensical set of tokens.
[0006] Accordingly, to deal with multilingual objects, a number of approaches are used. One approach is to maintain separate search systems (or sites) specific to each language. Thus, objects at each site are indexed and searched according to the associated language. For example, a website from a provider at a “.fr” domain may index objects according to French such that those documents can be effectively searched using French, while a website from the same provider at a “de” domain may index objects according to German such that those documents can be searched using German, etc. As can be realized, such an architecture is extremely inefficient, as it requires the separate indexing of all documents according to each language it is desired to support, maintaining a separate index for each language for all of those documents, and the maintenance of a separate interface/site for each of those languages.
[0007] Another approach to handle multilingual documents is to use a single language per object methodology. In this approach, when an object is received some language identification methodology is employed to identify the language of the document. Usually an initial portion of the object is sampled and a language identification methodology is applied to this initial
portion to identify the language. A language analysis associated with the identified language is then applied to index the object. Similarly, a language identification methodology is applied to a search query and a language analysis associated with a language identified for the search query used to determine search terms to utilize when searching. In this manner, only a single language analysis is applied to index documents and to determine search terms based on a search query.
[0008] As a consequence, these single language approaches are heavily dependent on accurate language detection for both the indexing of the object and the search query (e.g., both precision and recall may be reduced if language is improperly detected). Moreover, for objects that include content in multiple languages, the included content that is in any other language than the single language identified to index that document may be mis-analyzed and mis-indexed (e.g., the content in languages other than the identified language is linguistically analyzed using language analysis adapted for a different language (i.e., the identified language) and the tokens from that improper analysis indexed for that object). Thus, any content of these multilingual documents that is in any language differing from the single identified language for the object cannot be effectively searched, as it has been (mis) indexed using language analysis adapted for another language. Additionally, this ineffectiveness may be exacerbated when the differing languages are linguistically far apart, such when content from a language using a Latin alphabet is included in a document with content in a language based on other characters (e.g., such as Chinese, Japanese, or Korean (CJK)).
[0009] As can be seen, these approaches to dealing with multiple language documents are each problematic in their own way. Some approaches sacrifice the effectiveness of the search system (e.g., with respect to precision or recall) in certain contexts while others make inefficient use of storage space or other computer resources by requiring the processing of entire objects according to different language analysis processes and the storage of large indexes (or portions of indexes) that include the results (the extracted tokens) of each of these different language analyses.
[0010] What is desired then are improved search systems that may deal with multilingual objects in an efficient and effective manner.
[0011] To those ends, among others, attention is now directed to embodiments of a multilingual search system as disclosed herein. These multilingual search system may analyze objects (and thus similarly search queries) using a multilingual object analyzer that fragments the
text (content) of the object into one or more fragments. While almost any type of fragmentation desired may be utilized to determine fragments of text, in one embodiment this fragmentation can be based on a sentence boundary detection model using Natural Language Processing (NLP) or the like. Such a model may, for example, be based on Apache’s OpenNLP (e.g., may be a pluggable model that can be trained and provided).
[0012] In one embodiment, such a sentence boundary detection model may be trained based on sentence markers for multiple languages (which be the same as, or a subset or superset of, the languages supported by the multilingual search system). These sentence markers may be sentence markers from, for example, Japanese, Korean, Chinese (CJK), or other languages which do not use a period (or whitespace in certain contexts) to delineate the end of a sentence. Similarly, in some cases then, before fragmentation is performed on the text of the object, a pre-splitting of the text may be performed by splitting (e.g., by inserting whitespace or another character or marker) into the text where certain characters delineating the end of a sentence (e.g., such as in Japanese, Chinese or Korean) occur in the text. Such pre-splitting may thus make the fragmentation of the text of the object more effective.
[0013] For each of those fragments, a language detection may be performed to identify a language associated with each fragment. Again, such language detection may be performed based on a language detection model, where the language detection model may be a (e.g., pluggable) model based on Apache’s OpenNLP. Once the language is identified for that fragment, the fragment can be provided to a language analyzer for the identified language. The tokens for that fragment identified by that language analysis can then be indexed for that object. In this manner, all the tokens identified for each of the fragments by the respective language analyzer used for each fragment may be stored for the object and indexed to create a multilingual index. Similarly, the set of objects of the corpus may be searched by providing a search query to the multilingual object analyzer such that the search query is fragmented (if needed) into one or more fragments such that language detection and analysis can be performed on each of those fragments to determine the search terms. These search terms can then be applied against the multilingual index to determine the search results.
[0014] Advantageously, embodiments may allow the efficient implementation of multilingual search. Specifically, when language detection or analysis fails it may fail only for a particular fragment (e.g., not for the entire document). Thus, the document may be effectively indexed and subsequently searched even when such detection or analysis fails for a fragment.
[0015] Additionally, certain embodiments may utilize search engines that utilize fields to store various aspects of objects and that analyze objects using analysis chains associated with such fields to create indexes based on such fields. Example of such search engines are Apache’s Solr and Elasticsearch. The implementation of embodiments as disclosed may advantageously allow the storage of multilingual tokens of an object to be associated with a single field for an object and thus may allow a multilingual search index to be built based on this single field associated for the object. This capability thus additionally allows a single analysis chain to be implemented in association with such a field where that single analysis chain can implement multilingual analysis for the content of that object.
[0016] Moreover, the architecture of such multilingual search systems allows the simple extensibility of such search systems by allowing users or other third parties to easily expand the multilingual search system to process documents including almost any language desired. In particular, third parties (e.g., users of such search system) can provide a language analysis and language identification model for the language desired by including them in (e.g., plugging them into) the identification and analysis phases of the indexing to get support for the desired language without altering the architecture (e.g., the analysis chain) for the search system.
[0017] These, and other, aspects of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. The following description, while indicating various embodiments of the invention and numerous specific details thereof, is given by way of illustration and not of limitation. Many substitutions, modifications, additions or rearrangements may be made within the scope of the invention, and the invention includes all such substitutions, modifications, additions or rearrangements.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] The drawings accompanying and forming part of this specification are included to depict certain aspects of the invention. A clearer impression of the invention, and of the components and operation of systems provided with the invention, will become more readily apparent by referring to the exemplary, and therefore non-limiting, embodiments illustrated in the drawings, wherein identical reference numerals designate the same components. Note that the features illustrated in the drawings are not necessarily drawn to scale.
[0019] FIGURE 1 is a block diagram depicting a computing environment in which one embodiment of a search system can be implemented.
[0020] FIGURE 2 is a block diagram depicting one embodiment of a search engine.
[0021] FIGURES 3A and 3B are a block diagram depicting one embodiment of an object analyzer.
[0022] FIGURE 4 is a block diagram depicting an index.
[0023] FIGURES 5A and 5B are flow diagrams depicting embodiments of methods for the indexing and search of multilingual documents.
DETAILED DESCRIPTION
[0024] The disclosure and various features and advantageous details thereof are explained more fully with reference to the exemplary, and therefore non-limiting, embodiments illustrated in the accompanying drawings and detailed in the following description. It should be understood, however, that the detailed description and the specific examples are given by way of illustration only and not by way of limitation. Descriptions of known programming techniques, computer software, hardware, operating platforms and protocols may be omitted so as not to unnecessarily obscure the disclosure in detail. Various substitutions, modifications, additions and/or rearrangements within the spirit and/or scope of the underlying inventive concept will become apparent to those skilled in the art from this disclosure.
[0025] Before discussing embodiments in detail, some context may be useful. As discussed, a search engine is a computer program or a set of programs used to index information (referred to as a corpus) and search for indexed information. In a conventional search engine, a user submits a query and the search engine selects a set of results from the corpus based on the terms of the search query.
[0026] FIGURE 1 depicts a block diagram illustrating one embodiment of a computing environment 100 with object search system 101 . Computing environment 100 includes an object repository 105 storing a corpus of objects 107 of interest (documents, images, emails or other objects that may be searched). Object repository 105 may comprise a file server or database system or other storage mechanism remotely or locally accessible by search system 101.
[0027] In the embodiment of FIGURE 1 , search system 101 comprises a server having a central processing unit 112 connected to a memory 114 and storage unit 118 via a bus. Central processing unit 112 may represent a single processor, multiple processors, a processor(s) with multiple processing cores and the like. Storage unit 118 may include a non-transitory storage medium such as hard — disk drives, flash memory devices, optical media and the like. Search system 101 may be connected to a data communications network (not shown).
[0028] Storage unit 118 stores computer executable instructions 119, index 124, and value storage 125. Computer executable instructions 119 can represent multiple programs and operating system code. In one embodiment, instructions 119 are executable to provide an object analyzer 120 and search engine 122. Object analyzer 120 and search engine 122 may be
portions of the same program or may be separate programs. According to one embodiment, for example, object analyzer 120 is a component of one system while search engine 122 is a separate program that interfaces with another system. Furthermore, object analyzer 120 and search engine 122 can be implemented on different computing systems and can, themselves, be distributed.
[0029] Index 124 may include text (e.g., tokens) for objects. Index 124 can include a single index containing metadata and text, separate metadata and text indices or other arrangements of information. While shown as a single index, index 124 may include multiple indices. Further, index 124 may be associated with a particular field defined for objects such that the index 124 may include tokens determined for that field for each object.
[0030] In one embodiment, search engine 122 may utilize fields associated with aspects of objects and that analyze objects using analysis chains associated with such fields to create index 124 based on such fields. Example of such search engines are Apache’s Solr and Elasticsearch. Thus, in such an embodiment, object analyzer 120 may comprise an analysis chain associated with a field (e.g., a single field such as a text field) defined for objects 107 of the corpus stored in repository 105.
[0031] Client computer system 130 may include components similar to those of the server of search system 101 , such as CPU 138, memory 136, and storage 140. Additionally, client computer system 130 may include executable instructions 132 to provide a user interface 134 that allows a user to enter a search query. The user interface may be provided through a web browser, file system interface or other program.
[0032] In operation, object analyzer 120 analyzes objects in object repository 105 to determine information to be indexed in index 124 or to analyze search queries received through the query interface 134. Object analyzer 120 can send indexing instructions to search engine 122 to direct search engine 122 to add/modify/ delete metadata or text in index 124. As an example, suppose object 107 being added to search system 101 is a text file, the text or content of the file is indexed as well as information about the file. When a search query is received, search engine 122 can search the information in index 124 to identify objects responsive to the search query and return a list or other representation of those objects to client computer 130.
[0033] In particular, as will be discussed in more detail, object analyzer 120 may be a multilingual object analyzer adapted to determine tokens to index (or search terms) for multilingual
objects 107. Thus, in instances where object analyzer 120 is an analysis chain associated with a field (e.g., a single field such as a text field) defined for objects 107, these determined tokens may be multilingual tokens associated with a field for that object 107 and used to create a hybrid index 124 (e.g., an index comprising tokens determine according to analysis of the content of an object according to multiple languages) associated with the field.
[0034] FIGURE 2 depicts a diagrammatic representation of logical blocks for one embodiment of a search engine 122. Search engine 122 may provide an indexing interface 200 that receives indexing requests for objects or another source. Such an indexing request may specify an operation to be taken on index 124 for an object and data (e.g., such as tokens for the object being indexed) for that action. For context, an application that generates an indexing request may be a document management system, a web site with a search capability such as an online store, or a desktop search program for email.
[0035] According to one embodiment, for example, an indexing request can take the form of an indexing object that includes a unique identification for an object, an operation, a metadata or text field affected and the metadata or text for the index. By way of example, but not limitation, indexing operations may include adding, replacing, modifying and deleting information in the index, or combinations thereof.
[0036] A distributor module 210 may distribute the indexing requests to indexing engine 220 that act on an indexing request to update index 124. This indexing engine 220 can thus determine the tokens to pass to the index 124 for the object using the object analyzer 120 or an indexing request may include the tokens to add to index 124 as determined by the object analyzer 120.
[0037] Search engine 122 may also include a search interface 230 to receive queries. Search interface 230 may be configured to receive a search query, such that index 124 can be used to search for objects that meet the criteria set forth in the search query. Specifically, search interface 230 may utilize an object analyzer to actually determine search terms to utilize for the received search query and provide those search terms to a search module 240. Search modules 240 are responsible for performing searches on an index 124, and performing tasks such as computing relevance score, sorting results, and retrieving metadata regions to return in a query. Federator 245 gathers the results from all search modules together, and generates a response to the query received through search interface 230.
[0038] The embodiment of FIGURE 2 is provided by way of example. Search engine 122 may include any number of other modules or configurations to update and search an index. For example, search modules 240 and indexing engines 220 may be a single module. Search engine 122 may be a portion of a larger program, such as a document management program, may be a separate program or may be implemented according to any suitable programming architecture. In one embodiment, the processes of search engine 122 may be distributed across multiple computer systems. Furthermore, while in FIGURE 2, index 124 is illustrated as a single index, index 124 may comprise a set of smaller indexes. For example, a separate index could be used by each indexing engine.
[0039] Referring now to FIGURES 3A and 3B, a block diagram of one embodiment of a multilingual object analyzer 320 that may be employed to determine tokens to index for an object, or search terms for a search query is depicted. Here, the object to be indexed (e.g., the content of the object) 307 or the search query is provided to the object analyzer 320. This object 307 may include text in multiple languages. It will be noted here that while embodiments as described with respect to FIGURES 3A and 3B will be described with respect to an object, it will be noted that such descriptions may apply equally well to the analysis of search queries in the context of performing analysis of such search queries to determine search terms of such queries.
[0040] Fragmenter 302 fragments the text (content) of the object 307 into one or more fragments 304. While almost any type of fragmentation desired may be utilized to determine fragments of text, in one embodiment this fragmentation can be based on a sentence boundary detection model using Natural Language Processing (NLP) or the like. Such a model may, for example, be based on Apache’s OpenNLP (e.g., may be a pluggable model that can be trained and provided).
[0041] In one embodiment, such a sentence boundary detection model may be trained based on sentence markers for multiple languages (which be the same as, or a subset or superset of, the languages supported by the multilingual search system). These sentence markers may be sentence markers from, for example, Japanese, Korean, Chinese, or other languages which do not use a period (or whitespace in certain contexts) to delineate the end of a sentence.
[0042] Similarly, in some cases then, before fragmenter 302 is applied to the text of the object 307, a pre-splitting of the text may be performed by pre-splitter 306 by splitting (e.g., by inserting whitespace or another character or marker) into the text where certain characters delineating
the end of a sentence (e.g., such as in Japanese, Chinese or Koran) occur in the text. Such pre-splitting may thus make the fragmentation of the text of the object by fragmenter 302 more effective.
[0043] For each of those fragments 304, a language detection may be performed by language detection module 308 to identify a language associated with each fragment 304. This language detection may be performed based on a language detection model, where the language detection model may be a (e.g., pluggable) model based on Apache’s OpenNLP.
[0044] Once the language is identified for a fragment 304, the fragment can be provided to a language analyzer 314 for the language identified for that fragment 304 (e.g., language analyzer 314a may be for German, analyzer 314b for French, 314c for English, 314n for Greek, etc.). Thus, each fragment 304 is analyzed by a language analyzer 314 adapted for the language of that fragment 304 and that fragment of text of the object 307 analyzed according to a language analyzer 314 appropriate for that language. The tokens 312 for that fragment identified by that language analyzer 314 can then be indexed for that object 307. In this manner, all the tokens 312 identified for each of the fragments by the respective language analyzer 314 used for each fragment 304 may be stored for the object 307 and indexed to create a multilingual index (or provided to the search module or interface for conducting a search).
[0045] In some embodiment, object analyzer may be implemented as (or as part of) an analysis chain for a field associated with objects of a search system. As such, the architecture of object analyzer may be extensible by allowing users or other third parties to easily expand the object analyzer to process objects based on almost any language desired. In particular, third parties (e.g., users of such search system) can provide a language analysis or language identification model for the language desired by including them in (e.g., plugging them into) the identification and analysis phases of the object analyzer to get support for the desired language without altering the architecture of the object analyzer 320 (e.g., the analysis chain).
[0046] FIGURE 4 depicts one embodiment of a multilingual index. In one embodiment, for example, this index 424 (e.g., which may be a portion of a larger index utilized by a search system) may be associated with a single field 440a used for object analysis in a search engine such as Apache Solr, Elasticsearch or the like. Each object in the corpus may have an associated object identifier 414 (e.g., from which the object in the corpus can be located or obtained). This object identifier 414 can be associated with a set of fields 440, one of
which may be a text field 440a associated with tokens 412 determined from the respective object identified by the object identifier 414. An analysis chain may be associated with the text field 440a (e.g., such as an analysis chain implemented by an embodiment of an object analyzer as discussed).
[0047] Thus, such a multilingual index 424 may include the tokens 412 for that field 440a for each (or a subset) of the objects of the corpus that result from performing the analysis chain associated with the (e.g., single) field 440a on objects in the search system. Here, index 424 may include a dictionary 402 and a posting list 404. The dictionary 402 may include tokens 412 determined each objects (e.g., by the multilingual object analyzer as described), such that the index 424 includes the tokens 412 determined for all the objects of the corpus that have been indexed. It will be noticed here that a (e.g., single) field 440a and corresponding dictionary 402 includes tokens 412 from multiple languages determined according to their respective associated linguistic analysis for the corresponding language. Each of the tokens 412 may be associated with one or more entries in the posting list 404. Each entry in the posting list 404 includes an object identifier 414 of an object in which the associated token 412 appears, along with an identifier of each location 416 in that object where that token 412 appears.
[0048] Looking now at FIGURE 5A, a flow diagram for one embodiment of a method for creation of a multilingual index for objects of an index is depicted. Here, the (each) object to be indexed (e.g., the content of the object) may be received (STEP 502). This object may (or may not) include text in multiple languages. The object can be provided to an object analyzer. The object analyzer may, for example, be implemented as an analysis chain for a single field associated with the object in the search system.
[0049] The text (content) of the object can be fragmented into one or more fragments (STEP 504). While almost any type of fragmentation desired may be utilized to determine fragments of text, in one embodiment this fragmentation can be based on a sentence boundary detection model using Natural Language Processing (NLP) or the like. Such a model may, for example, be based on Apache’s OpenNLP (e.g., may be a pluggable model that can be trained and provided).
[0050] In one embodiment, such a sentence boundary detection model may be trained based on sentence markers for multiple languages (which be the same as, or a subset or superset of, the languages supported by the multilingual search system). These sentence markers may be sentence markers from, for example, Japanese, Korean, Chinese, or other languages
which do not use a period (or whitespace in certain contexts) to delineate the end of a sentence.
[0051] Similarly, in some cases then, before fragmenting the text of the object, a pre-splitting of the text may be performed by splitting (e.g., by inserting whitespace or another character or marker) into the text where certain characters delineating the end of a sentence (e.g., such as in Japanese, Chinese or Koran) occur in the text. Such pre-splitting may make the fragmentation of the text of the object more effective.
[0052] The tokens associated with each of those fragments can then be determined (STEP 506). Specifically, for each fragment a language detection may be performed to identify a language associated with that fragment (STEP 508). This language detection may be performed based on a language detection model, where the language detection model may be a (e.g., pluggable) model based on Apache’s OpenNLP.
[0053] Once the language is determined for the fragment, a language analyzer corresponding to the identified language may be applied to the text of that fragment to determine the tokens for that fragment (STEP 510) Thus, each fragment is analyzed by a language analyzer adapted for the language of that fragment, and that fragment of text of the object analyzed according to a language analyzer for that language. The tokens determined for each of the fragments of the objects (e.g., by applying appropriate language analysis to each fragment) can then be indexed (STEP 512). In this manner, all the tokens identified for each of the fragments by the respective language analyzer used for each fragment may be stored for the object and indexed to create a multilingual index (or provided to the search module or interface for conducting a search).
[0054] It will be apparent that a similar analysis may be performed for a search query to perform a search for objects using such a multilingual search index. FIGURE 5B depicts a flow diagram for one embodiment of a method for searching for objects using a multilingual index. Initially, a search query may be received (e.g., through a search interface or the like) (STEP 560). A similar analysis to that used to determine the tokens for an object may then be applied to determine the terms of the search query to be used to perform the search. In particular, the search query may be provided to the same (or a similar) object analyzer as that used for the objects of the search system.
[0055] The text of the search query can thus fragmented into one or more fragments (STEP 514) and the tokens associated with each of those fragments determined (STEP 516) by
identifying a language associated with that fragment (STEP 518) using a language detection model. A language analyzer corresponding to the identified language may then be applied to the text of that fragment of the search query to determine the tokens for that fragment (STEP 520). The tokens determined for each of the fragments of the search query can then be used as search terms to search the objects of the search system using the multilingual index (STEP 522).
[0056] Although the invention has been described with respect to specific embodiments thereof, these embodiments are merely illustrative, and not restrictive of the invention. The description herein of illustrated embodiments of the invention is not intended to be exhaustive or to limit the invention to the precise forms disclosed herein (and in particular, the inclusion of any particular embodiment, feature or function is not intended to limit the scope of the invention to such embodiment, feature or function). Rather, the description is intended to describe illustrative embodiments, features and functions in order to provide a person of ordinary skill in the art context to understand the invention without limiting the invention to any particularly described embodiment, feature or function. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes only, various equivalent modifications are possible within the spirit and scope of the invention, as those skilled in the relevant art will recognize and appreciate. As indicated, these modifications may be made to the invention in light of the foregoing description of illustrated embodiments of the invention and are to be included within the spirit and scope of the invention.
[0057] Thus, while the invention has been described herein with reference to particular embodiments thereof, a latitude of modification, various changes and substitutions are intended in the foregoing disclosures, and it will be appreciated that in some instances some features of embodiments of the invention will be employed without a corresponding use of other features without departing from the scope and spirit of the invention as set forth. Therefore, many modifications may be made to adapt a particular situation or material to the essential scope and spirit of the invention.
[0058] Reference throughout this specification to “one embodiment”, “an embodiment”, or “a specific embodiment” or similar terminology means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment and may not necessarily be present in all embodiments. Thus, respective appearances of the phrases “in one embodiment”, “in an embodiment”, or “in a specific
embodiment” or similar terminology in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics of any particular embodiment may be combined in any suitable manner with one or more other embodiments. It is to be understood that other variations and modifications of the embodiments described and illustrated herein are possible in light of the teachings herein and are to be considered as part of the spirit and scope of the invention.
[0059] In the description herein, numerous specific details are provided, such as examples of components and/or methods, to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that an embodiment may be able to be practiced without one or more of the specific details, or with other apparatus, systems, assemblies, methods, components, materials, parts, and/or the like. In other instances, well-known structures, components, systems, materials, or operations are not specifically shown or described in detail to avoid obscuring aspects of embodiments of the invention. While the invention may be illustrated by using a particular embodiment, this is not and does not limit the invention to any particular embodiment and a person of ordinary skill in the art will recognize that additional embodiments are readily understandable and are a part of this invention.
[0060] Embodiments discussed herein can be implemented in a computer communicatively coupled to a network (for example, the Internet), another computer, or in a standalone computer. As is known to those skilled in the art, a suitable computer can include a central processing unit (“CPU”), at least one read-only memory (“ROM”), at least one random access memory (“RAM”), at least one hard drive (“HD”), and one or more input/output (“I/O”) device(s). The I/O devices can include a keyboard, monitor, printer, electronic pointing device (for example, mouse, trackball, stylist, touch pad, etc.), or the like.
[0061] ROM, RAM, and HD are computer memories for storing computer-executable instructions executable by the CPU or capable of being complied or interpreted to be executable by the CPU. Suitable computer-executable instructions may reside on a computer readable medium (e.g., ROM, RAM, and/or HD), hardware circuitry or the like, or any combination thereof. Within this disclosure, the term “computer readable medium” is not limited to ROM, RAM, and HD and can include any type of data storage medium that can be read by a processor. For example, a computer-readable medium may refer to a data cartridge, a data backup magnetic tape, a floppy diskette, a flash memory drive, an optical data storage drive, a CD-ROM, ROM, RAM, HD, or the like. The processes described herein may be
implemented in suitable computer-executable instructions that may reside on a computer readable medium (for example, a disk, CD-ROM, a memory, etc.). Alternatively, the computer-executable instructions may be stored as software code components on a direct access storage device array, magnetic tape, floppy diskette, optical storage device, or other appropriate computer-readable medium or storage device.
[0062] Any suitable programming language can be used to implement the routines, methods or programs of embodiments of the invention described herein, including C, C++, Java, JavaScript, HTML, or any other programming or scripting code, etc. Other software/hardware/network architectures may be used. For example, the functions of the disclosed embodiments may be implemented on one computer or shared/distributed among two or more computers in or across a network. Communications between computers implementing embodiments can be accomplished using any electronic, optical, radio frequency signals, or other suitable methods and tools of communication in compliance with known network protocols.
[0063] Different programming techniques can be employed such as procedural or object oriented. Any particular routine can execute on a single computer processing device or multiple computer processing devices, a single computer processor or multiple computer processors. Data may be stored in a single storage medium or distributed through multiple storage mediums, and may reside in a single database or multiple databases (or other data storage techniques). Although the steps, operations, or computations may be presented in a specific order, this order may be changed in different embodiments. In some embodiments, to the extent multiple steps are shown as sequential in this specification, some combination of such steps in alternative embodiments may be performed at the same time. The sequence of operations described herein can be interrupted, suspended, or otherwise controlled by another process, such as an operating system, kernel, etc. The routines can operate in an operating system environment or as stand-alone routines. Functions, routines, methods, steps and operations described herein can be performed in hardware, software, firmware or any combination thereof.
[0064] Embodiments described herein can be implemented in the form of control logic in software or hardware or a combination of both. The control logic may be stored in an information storage medium, such as a computer-readable medium, as a plurality of instructions adapted to direct an information processing device to perform a set of steps disclosed in the various
embodiments. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will appreciate other ways and/or methods to implement the invention.
[0065] It is also within the spirit and scope of the invention to implement in software programming or code an of the steps, operations, methods, routines or portions thereof described herein, where such software programming or code can be stored in a computer-readable medium and can be operated on by a processor to permit a computer to perform any of the steps, operations, methods, routines or portions thereof described herein. The invention may be implemented by using software programming or code in one or more general purpose digital computers, by using application specific integrated circuits, programmable logic devices, field programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, components and mechanisms may be used. In general, the functions of the invention can be achieved by any means as is known in the art. For example, distributed, or networked systems, components and circuits can be used. In another example, communication or transfer (or otherwise moving from one place to another) of data may be wired, wireless, or by any other means.
[0066] A “computer-readable medium” may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, system or device. The computer readable medium can be, by way of example only but not by limitation, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, system, device, propagation medium, or computer memory. Such computer-readable medium shall generally be machine readable and include software programming or code that can be human readable (e.g., source code) or machine readable (e.g., object code). Examples of non-transitory computer-readable media can include random access memories, read-only memories, hard drives, data cartridges, magnetic tapes, floppy diskettes, flash memory drives, optical data storage devices, compact-disc read-only memories, and other appropriate computer memories and data storage devices. In an illustrative embodiment, some or all of the software components may reside on a single server computer or on any combination of separate server computers. As one skilled in the art can appreciate, a computer program product implementing an embodiment disclosed herein may comprise one or more non-transitory computer readable media storing computer instructions translatable by one or more processors in a computing environment.
[0067] A “processor” includes any hardware system, mechanism or component that processes data, signals or other information. A processor can include a system with a general-purpose central processing unit, multiple processing units, dedicated circuitry for achieving functionality, or other systems. Processing need not be limited to a geographic location, or have temporal limitations. For example, a processor can perform its functions in “real-time,” “offline,” in a “batch mode,” etc. Portions of processing can be performed at different times and at different locations, by different (or the same) processing systems.
[0068] It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as is useful in accordance with a particular application. Additionally, any signal arrows in the drawings/Figures should be considered only as exemplary, and not limiting, unless otherwise specifically noted.
[0069] As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, product, article, or apparatus that comprises a list of elements is not necessarily limited only those elements but may include other elements not expressly listed or inherent to such process, product, article, or apparatus.
[0070] Furthermore, the term “or” as used herein is generally intended to mean “and/or” unless otherwise indicated. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
[0071] Although the foregoing specification describes specific embodiments, numerous changes in the details of the embodiments disclosed herein and additional embodiments will be apparent to, and may be made by, persons of ordinary skill in the art having reference to this disclosure. In this context, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of this disclosure.
Claims
1 . A system, comprising: a processor; and a computer readable medium storing instructions translatable by the processor to implement a multilingual search engine, comprising instructions for: receiving a multilingual object; determining a set of fragments of the text of the object, where each of the set of fragments comprises a portion of the text of the object; determining a language associated with each fragment of the determined set of fragments of the object; determining a set of tokens for each fragment of the determined set of fragments, where determining the set of tokens for a fragment comprises analyzing the portion of the text of that fragment based on the language associated with that fragment; and indexing the set of tokens determined for each fragment in an index in association with the object.
2. The system of claim 1 , wherein the set of tokens are associated with a single field of the object.
3. The system of claim 1 , wherein the indexed set of tokens are associated with a single field of the object.
4. The system of claim 1 , where each of the set of fragments is a sentence.
5. The system of claim 4, wherein determining the set of fragments is done using a machine learning model trained on sentence markers for multiple languages
6. The system of claim 1 , further comprising presplitting the text of the object based upon one or more markers before determining the set of fragments.
7. The system of claim 1 , wherein analyzing the portion of the text of that fragment based on the language associated with that fragment comprises selecting a language model of a set of pluggable language models.
8. A method, comprising: receiving a multilingual object; determining a set of fragments of the text of the object, where each of the set of fragments comprises a portion of the text of the object; determining a language associated with each fragment of the determined set of fragments of the object; determining a set of tokens for each fragment of the determined set of fragments, where determining the set of tokens for a fragment comprises analyzing the portion of the text of that fragment based on the language associated with that fragment; and indexing the set of tokens determined for each fragment in an index in association with the object.
9. The method of claim 8, wherein the set of tokens are associated with a single field of the object.
10. The method of claim 8, wherein the indexed set of tokens are associated with a single field of the object.
11 . The method of claim 8, where each of the set of fragments is a sentence.
12. The method of claim 11 , wherein determining the set of fragments is done using a machine learning model trained on sentence markers for multiple languages
13. The method of claim 8, further comprising presplitting the text of the object based upon one or more markers before determining the set of fragments.
14. The method of claim 8, wherein analyzing the portion of the text of that fragment based on the language associated with that fragment comprises selecting a language model of a set of pluggable language models.
15. A non-transitory computer readable medium, comprising instructions for: receiving a multilingual object; determining a set of fragments of the text of the object, where each of the set of fragments comprises a portion of the text of the object;
determining a language associated with each fragment of the determined set of fragments of the object; determining a set of tokens for each fragment of the determined set of fragments, where determining the set of tokens for a fragment comprises analyzing the portion of the text of that fragment based on the language associated with that fragment; and indexing the set of tokens determined for each fragment in an index in association with the object.
16. The non-transitory computer readable medium of claim 15, wherein the set of tokens are associated with a single field of the object.
17. The non-transitory computer readable medium of claim 15, wherein the indexed set of tokens are associated with a single field of the object.
18. The non-transitory computer readable medium of claim 15, where each of the set of fragments is a sentence.
19. The non-transitory computer readable medium of claim 18, wherein determining the set of fragments is done using a machine learning model trained on sentence markers for multiple languages
20. The non-transitory computer readable medium of claim 15, further comprising presplitting the text of the object based upon one or more markers before determining the set of fragments.
21 . The non-transitory computer readable medium of claim 15, wherein analyzing the portion of the text of that fragment based on the language associated with that fragment comprises selecting a language model of a set of pluggable language models.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/962,157 | 2022-10-07 | ||
US17/962,157 US20240119076A1 (en) | 2022-10-07 | 2022-10-07 | System and method for hybrid multilingual search indexing |
US17/962,177 US20240119070A1 (en) | 2022-10-07 | 2022-10-07 | System and method for hybrid multilingual search indexing |
US17/962,177 | 2022-10-07 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024075086A1 true WO2024075086A1 (en) | 2024-04-11 |
Family
ID=90607647
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2023/060075 WO2024075086A1 (en) | 2022-10-07 | 2023-10-06 | System and method for hybrid multilingual search indexing |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024075086A1 (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6658377B1 (en) * | 2000-06-13 | 2003-12-02 | Perspectus, Inc. | Method and system for text analysis based on the tagging, processing, and/or reformatting of the input text |
US6842730B1 (en) * | 2000-06-22 | 2005-01-11 | Hapax Limited | Method and system for information extraction |
US20120095748A1 (en) * | 2010-10-14 | 2012-04-19 | Microsoft Corporation | Language Identification in Multilingual Text |
EP2807535A1 (en) * | 2012-01-27 | 2014-12-03 | Touchtype Limited | User data input prediction |
US20170364510A1 (en) * | 2016-06-21 | 2017-12-21 | EMC IP Holding Company LLC | Method and device for processing a multi-language text |
US20190108276A1 (en) * | 2017-10-10 | 2019-04-11 | NEGENTROPICS Mesterséges Intelligencia Kutató és Fejlesztõ Kft | Methods and system for semantic search in large databases |
US20210334299A1 (en) * | 2020-04-24 | 2021-10-28 | Roblox Corporation | Language detection of user input text for online gaming |
-
2023
- 2023-10-06 WO PCT/IB2023/060075 patent/WO2024075086A1/en unknown
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6658377B1 (en) * | 2000-06-13 | 2003-12-02 | Perspectus, Inc. | Method and system for text analysis based on the tagging, processing, and/or reformatting of the input text |
US6842730B1 (en) * | 2000-06-22 | 2005-01-11 | Hapax Limited | Method and system for information extraction |
US20120095748A1 (en) * | 2010-10-14 | 2012-04-19 | Microsoft Corporation | Language Identification in Multilingual Text |
EP2807535A1 (en) * | 2012-01-27 | 2014-12-03 | Touchtype Limited | User data input prediction |
US20170364510A1 (en) * | 2016-06-21 | 2017-12-21 | EMC IP Holding Company LLC | Method and device for processing a multi-language text |
US20190108276A1 (en) * | 2017-10-10 | 2019-04-11 | NEGENTROPICS Mesterséges Intelligencia Kutató és Fejlesztõ Kft | Methods and system for semantic search in large databases |
US20210334299A1 (en) * | 2020-04-24 | 2021-10-28 | Roblox Corporation | Language detection of user input text for online gaming |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210382927A1 (en) | System and method for hierarchically organizing documents based on document portions | |
US20240070177A1 (en) | Systems and methods for generating and using aggregated search indices and non-aggregated value storage | |
US10878233B2 (en) | Analyzing technical documents against known art | |
US10943064B2 (en) | Tabular data compilation | |
US20160292153A1 (en) | Identification of examples in documents | |
US8825620B1 (en) | Behavioral word segmentation for use in processing search queries | |
US10810245B2 (en) | Hybrid method of building topic ontologies for publisher and marketer content and ad recommendations | |
EP3679488A1 (en) | System and method for recommendation of terms, including recommendation of search terms in a search system | |
CA2833355C (en) | System and method for automatic wrapper induction by applying filters | |
Dutta et al. | A graph based approach on extractive summarization | |
Shawon et al. | Website classification using word based multiple n-gram models and random search oriented feature parameters | |
CN105224624A (en) | A kind of method and apparatus realizing down the quick merger of row chain | |
de Oliveira et al. | Evaluating and mitigating the impact of OCR errors on information retrieval | |
CN111373386A (en) | Similarity index value calculation device, similarity search device, and similarity index value calculation program | |
Ozyurt et al. | Resource disambiguator for the web: extracting biomedical resources and their citations from the scientific literature | |
WO2015084757A1 (en) | Systems and methods for processing data stored in a database | |
CN110705285A (en) | Government affair text subject word bank construction method, device, server and readable storage medium | |
Paramita et al. | Methods for collection and evaluation of comparable documents | |
Li et al. | A Novel Approach for Protein‐Named Entity Recognition and Protein‐Protein Interaction Extraction | |
Hammad et al. | Clone-advisor: recommending code tokens and clone methods with deep learning and information retrieval | |
US20240119070A1 (en) | System and method for hybrid multilingual search indexing | |
US20240119076A1 (en) | System and method for hybrid multilingual search indexing | |
WO2024075086A1 (en) | System and method for hybrid multilingual search indexing | |
US9483553B2 (en) | System and method for identifying related elements with respect to a query in a repository | |
Testas | Natural Language Processing with Pandas, Scikit-Learn, and PySpark |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23874424 Country of ref document: EP Kind code of ref document: A1 |