WO2000049526A1 - Similarity searching by combination of different data-types - Google Patents

Similarity searching by combination of different data-types Download PDF

Info

Publication number
WO2000049526A1
WO2000049526A1 PCT/GB2000/000489 GB0000489W WO0049526A1 WO 2000049526 A1 WO2000049526 A1 WO 2000049526A1 GB 0000489 W GB0000489 W GB 0000489W WO 0049526 A1 WO0049526 A1 WO 0049526A1
Authority
WO
WIPO (PCT)
Prior art keywords
elements
document
data type
data types
searching
Prior art date
Application number
PCT/GB2000/000489
Other languages
French (fr)
Inventor
William Sharpe
Roland John Burns
Original Assignee
Hewlett-Packard Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Company filed Critical Hewlett-Packard Company
Priority to EP00903814A priority Critical patent/EP1072001A1/en
Priority to JP2000600197A priority patent/JP2002537604A/en
Publication of WO2000049526A1 publication Critical patent/WO2000049526A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5854Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using shape and object relationship
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Definitions

  • the present invention relates to a method and means for searching to find similar documents in response to a query.
  • the invention is particularly relevant to the use of one document as a query for a search to obtain similar documents.
  • Similarity searching in databases of electronically stored documents is an important area of practical application. Such searching is well known for text. Typically, the input for such searching would be a text string, and the engine would then search the database matching entries against the text string and return entries with an acceptable similarity threshold. Similar searching is available for images - an example is the IBM Corporation QBIC (Query by Image Content) package, described at and available from http://wwwqbic.almaden.ibm.com/.
  • This structural information is then used to allow user searching and text indexing in chosen functional elements of the document.
  • This mechanism is particularly useful for making the problem of text searching in complex documents more tractable - it is not, however, effective to allow searching for documents which are as a whole similar to a query document.
  • the invention provides method of searching a database to find documents similar to a query document, comprising: decomposing the query document into elements of different data types; for one or more of the elements in a first data type, conducting a first data type similarity search to return match results from the database for the one or more elements in the first data type; for one or more of the elements in a second data type, conducting a second data type similarity search to return match results from the database for the one or more elements in the second data type; combining the match results from the first data type similarity search and the second data type similarity search to provide query document match results.
  • results from each query document match may be combined to allow progressive refinement of queries using any of the data types either singly or in further combination.
  • the invention provides a method of searching a database to find documents similar to a query document, comprising: decomposing the query document into elements of different data types; determining a layout element in a layout datatype from the spatial arrangement of the elements in the document; for the layout element, conducting a layout similarity search to return match results from the database for the layout element.
  • Figure 1 shows a typical document page containing different data types
  • Figure 2 shows steps in a method according to an embodiment of a first aspect of the invention for conducting a similarity search for the document shown in Figure 1 ;
  • Figure 3 shows the representation of the document shown in Figure 1 as a layout of datatypes, and indicates a search step usable in a further embodiment of the method of the invention.
  • Figure 4 shows steps in a method according to an embodiment of the second aspect of the invention for conducting a similarity search for layout information.
  • a typical document contains a plurality of data types.
  • the most basic data types are text and images.
  • Document 1 shown in Figure 1 contains a text block 12 - this text block is data in a first data type.
  • Document 1 also contains two different kinds of image.
  • One kind, image block 13 is a photographic image, typically consisting of an array of pixels in which each pixel has a colour value.
  • the other kind, line art block 11 is also an image but a "drawn" one, readily representable as a combination of geometric or formulaic elements - and as such, typically readily scalable.
  • Photographic images and line art images (hereafter “pictures” and "graphics”) respond differently to different image processing and analysis techniques, and are most effectively treated as different data types.
  • the document 1 is selected in step 21.
  • this could be achieved through any appropriate application capable of supporting the file type or file types of the document.
  • For a physical document this could be achieved by scanning the document using a scanner.
  • step 22 the document is decomposed into separate elements: in the case of document 1, these elements are graphic block 11, text block 12, and picture block 13.
  • these elements are graphic block 11, text block 12, and picture block 13.
  • text block 12 it is desirable for optical character recognition to be carried out at this point so that the text block element resulting from decomposition consists of ASCII text.
  • Decomposition of the document is achieved by an analysis and recognition process through which the different parts of the document are recognised as being text, pictures or graphics. Decomposition of a document into separate data types in this way is known, using for example techniques identified in "Block Segmentation and Text Extraction in Mixed Text/Image Documents" by FM Wahl, KY Wong and RG Casey, Computer Graphics and Image Processing, Vol. 20 (1982) (a further example is provided in US Patent No. 6,002,798).
  • HP PrecisionScan Software adapted for use with proprietary scanners to decompose the elements of a scanned page into separate data types (in order to optimise the scanning process for each data type) is provided by Hewlett-Packard Company as "HP PrecisionScan".
  • the output of HP PrecisionScan is a set of elements each in a single data type, each of which can be selected for further processing.
  • the result of decomposition is a set of elements, each element having a single data type. For a particular data type, such as text, then either all text is determined to be part of a single element, or else physically distinct areas of text are considered as separate elements, depending on how the decomposition is carried out.
  • all the elements of the document are used in similarity searching: in other versions one or more of the elements are selected for use in similarity searching (or the user is even allowed an opportunity to select part of an element for such further processing).
  • Separate elements are then used in similarity searching 23, 24 against a database, for example a database representing content available on the World Wide Web.
  • Inxight Summarizer is a software component technology that summarises a document by extracting key sentences from the document. This is the preconditioning step 23. These summaries can then be matched against each other in the matching step 24. Inxight Summarizer generates indicative summaries that contain key sentence, elements from a document. The essence of the text isextractedby stemming and text normalisation technology to obtain a concise and canonical synopsis of the text. "Stemming” is the replacement of a word by its root and part-of- speech (e.g. "I had wanted” -> “to want/first person/pluperfect"), whereas "normalisation” involves replacement of one of several forms with a "concept" (e.g.
  • the matching step 24 can then be carried out on the stemmed and normalised results of the preconditioning step 23 with confidence that text content which is genuinely similar will be matched without adverse influence from unwanted syntax considerations.
  • An example of an image searching tool is the IBM QBIC package, as indicated above.
  • QBIC is further described at http://wwwqbic.almaden.ibm.com/.
  • This package is adapted to precondition the images by analysing for a number of different criteria, such as colour percentages, colour layout, and textures occurring in the images. These criteria are then used in combination in a matching step 24.
  • searching a 'new' image for known objects from robot vision (a robot searching for parts in a bin), through to traffic monitoring systems
  • serial approach could be used effectively: for example, first using a "straight edge” histogram to enable differentiation between natural and artificial scenes; then using an "edge length” histogram (an shortage of long edges probably indicates a natural scene); testing for a large area of blue tone at the top of the image (indicating an outdoor scene); and testing for significant elements of flesh tones", indicating that there is an image containing representations of people - which can be followed by a face matching analysis to find the same faces.
  • edge length an shortage of long edges probably indicates a natural scene
  • testing for a large area of blue tone at the top of the image indicating an outdoor scene
  • testing for significant elements of flesh tones indicating that there is an image containing representations of people - which can be followed by a face matching analysis to find the same faces.
  • the result of the similarity searching is a set of series of matching scores for documents in the database, such a set existing for each element searched.
  • Each of these search scores needs to be normalised 25 for combination 26 to achieve a combined search result 27.
  • the normalisation step 25 is to ensure that a correct balance is given to the results of the different searching steps 24. This can either be to weight each element of the document equally, to weight each element of the document according to its perceived importance in the document, or according to a user assessment of the relative importance of the different elements of the document.
  • a preferred solution may involve a mixture of automatic and manual weighting.
  • a particularly effective approach is to use synopsis generation techniques on the textual part to produce a set of textual search criteria and also to present a set of possible criteria based on the non-textual parts. These criteria are then presented to the user for verification.
  • Such a user based approach is easy to use (and it is also easy for a user to tell when it is ineffective). For example, auser may be asked if he/she wanted to search for things that matched the textual synopses, or, for the image and drawing parts, whether he wanted "this person", “scenes like this", “pictures containing this object”... or "pages that look like this one”.
  • the combined result 27 is as for conventional similarity searching: a series of matching scores (generally expressed as percentages) listing documents in the database from best towards worst matches.
  • a further output available from page decomposition is a data type plan 31 representing the document as a line art block, a text block, and an image block, arranged vertically in sequence - decomposition into layouts is discussed is US Patent No. 6,002,798.
  • this data type plan can itself be used as a layout data type. This allows yet another element - the layout data type element - to be used in searching 32 of a database (provided that layout information is available in or derivable from the database entries).
  • similarity searching is conducted using the layout data type alone.
  • the steps to be followed are essentially as in conventional similarity searching - this is shown in Figure 4, with elements common to the first aspect of the invention given the same reference numbers as in Figure 2.
  • Layout similarity searching is more powerful if a number of different data types are used for text and for overall document type. Using a rule-based approach, different text blocks and whole documents, especially in the case of formal workflow documents, can be assigned particular functions with relatively high confidence.
  • the difficulty of this problem depends on the nature and type of documents that are to be considered for matching. If the "universe" of documents is well defined, then there are tools available that can do an accurate job of classifying and labelling within that universe (e.g. OfficeMaid from DFKI). What is required in this case is classification according to a set of conventions laid down for the various classes of documents available for consideration. Conventions are here essentially rules that need not be closely followed: consequently an appropriate approach to this problem is rule based (most conveniently using fuzzy rules). Training of a neural network would also be an effective approach to adopt. The skilled person will appreciate how conventional fuzzy rule or neural network approaches could be adapted for use in a solution to this problem.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

This invention relates to managing multiple web servers, a web service system and method that allows a system operator to distribute content to each web server in the web service system and notifying a computer, such as a cache server, of content changes. In one embodiment, a method for notifying a computer of changed files includes identifying changes in a source file set, storing the identified changes in a modification list and transmitting the modification list to a computer. In one embodiment, a method for replicating changes in a source file set on a destination file system and for notifying a computer of the changes includes identifying changes in a source file set, storing the changes in a first modification list, and transmitting the first modification list to an agent having access to a destination file system.

Description

Simi larity searching by combinati on of different data-types
Field of Invention
The present invention relates to a method and means for searching to find similar documents in response to a query. The invention is particularly relevant to the use of one document as a query for a search to obtain similar documents.
Description of Prior Art
Similarity searching in databases of electronically stored documents is an important area of practical application. Such searching is well known for text. Typically, the input for such searching would be a text string, and the engine would then search the database matching entries against the text string and return entries with an acceptable similarity threshold. Similar searching is available for images - an example is the IBM Corporation QBIC (Query by Image Content) package, described at and available from http://wwwqbic.almaden.ibm.com/.
Research has also been done on using structural analysis of a document in searching, particularly at the German Research Center for Artificial Intelligence GmbH (DFKI) in systems such as Office Maid and SALT. These systems are further described at http://www.dfld.uni-kl.de.
Existing techniques are effective when the query is of essentially one data type: a text string only, or an image only. In general, however, an electronic document will consist of a combination a number of data types: a typical document might contain one or more text passages, one or more images, and line art. The text passages may also be readily sub-dividable into different types, such as headings, legends, and bulk text. Using existing techniques as indicated above, similarity searching will involve extraction of one element in a particular data type followed by similarity searching appropriate to that data type. An example of such a sequential approach is found in US Patent No. 6,002,798. This provides for an initial structural analysis of a document into areas of different type: not simply into image plus text, but also into areas of different functional significance
(eg title, heading, text block). This structural information is then used to allow user searching and text indexing in chosen functional elements of the document. This mechanism is particularly useful for making the problem of text searching in complex documents more tractable - it is not, however, effective to allow searching for documents which are as a whole similar to a query document.
It is desirable to provide methods of similarity searching which allow the features of the document to be used appropriately in a search that is properly representative of the full document.
Summary of Invention
Accordingly, in a first aspect the invention provides method of searching a database to find documents similar to a query document, comprising: decomposing the query document into elements of different data types; for one or more of the elements in a first data type, conducting a first data type similarity search to return match results from the database for the one or more elements in the first data type; for one or more of the elements in a second data type, conducting a second data type similarity search to return match results from the database for the one or more elements in the second data type; combining the match results from the first data type similarity search and the second data type similarity search to provide query document match results.
Advantageously, results from each query document match may be combined to allow progressive refinement of queries using any of the data types either singly or in further combination.
In a second aspect, the invention provides a method of searching a database to find documents similar to a query document, comprising: decomposing the query document into elements of different data types; determining a layout element in a layout datatype from the spatial arrangement of the elements in the document; for the layout element, conducting a layout similarity search to return match results from the database for the layout element.
Brief Description of Figures
Specific embodiments of the invention are described below, by way of example, with reference to the accompanying drawings, of which:
Figure 1 shows a typical document page containing different data types;
Figure 2 shows steps in a method according to an embodiment of a first aspect of the invention for conducting a similarity search for the document shown in Figure 1 ;
Figure 3 shows the representation of the document shown in Figure 1 as a layout of datatypes, and indicates a search step usable in a further embodiment of the method of the invention; and
Figure 4 shows steps in a method according to an embodiment of the second aspect of the invention for conducting a similarity search for layout information.
Description of Embodiments
A typical document contains a plurality of data types. The most basic data types are text and images. Document 1 shown in Figure 1 contains a text block 12 - this text block is data in a first data type. Document 1 also contains two different kinds of image. One kind, image block 13, is a photographic image, typically consisting of an array of pixels in which each pixel has a colour value. The other kind, line art block 11 , is also an image but a "drawn" one, readily representable as a combination of geometric or formulaic elements - and as such, typically readily scalable. Photographic images and line art images (hereafter "pictures" and "graphics") respond differently to different image processing and analysis techniques, and are most effectively treated as different data types. Moreover, pictures and graphics will generally serve a different purpose in a document, so it is also practical for the purpose of similarity searching to treat pictures and graphics separately. The steps involved in similarity searching for the document of Figure 1 according to an embodiment of the first aspect of the invention are shown in Figure 2.
Firstly the document 1 is selected in step 21. For an electronic document, this could be achieved through any appropriate application capable of supporting the file type or file types of the document. For a physical document, this could be achieved by scanning the document using a scanner.
Secondly in step 22, the document is decomposed into separate elements: in the case of document 1, these elements are graphic block 11, text block 12, and picture block 13. In the case of text block 12, it is desirable for optical character recognition to be carried out at this point so that the text block element resulting from decomposition consists of ASCII text. Decomposition of the document is achieved by an analysis and recognition process through which the different parts of the document are recognised as being text, pictures or graphics. Decomposition of a document into separate data types in this way is known, using for example techniques identified in "Block Segmentation and Text Extraction in Mixed Text/Image Documents" by FM Wahl, KY Wong and RG Casey, Computer Graphics and Image Processing, Vol. 20 (1982) (a further example is provided in US Patent No. 6,002,798). Software adapted for use with proprietary scanners to decompose the elements of a scanned page into separate data types (in order to optimise the scanning process for each data type) is provided by Hewlett-Packard Company as "HP PrecisionScan". The output of HP PrecisionScan is a set of elements each in a single data type, each of which can be selected for further processing.
The result of decomposition is a set of elements, each element having a single data type. For a particular data type, such as text, then either all text is determined to be part of a single element, or else physically distinct areas of text are considered as separate elements, depending on how the decomposition is carried out. In one version of the embodiment all the elements of the document are used in similarity searching: in other versions one or more of the elements are selected for use in similarity searching (or the user is even allowed an opportunity to select part of an element for such further processing). Separate elements are then used in similarity searching 23, 24 against a database, for example a database representing content available on the World Wide Web. Should all the elements be of one data type, this reduces to a conventional similarity searching problem addressable with a single search engine for the relevant data type. However, if elements are of different data types, then separate search engines are used for each data type. Appropriate search engines for similarity searching for different data types are known. For example, for text, appropriate linguistic matching toolkits are available from Teragram Corporation (http://www.teragram.com) and Inxight Software, Inc. (http://www.inxight.com/). In each case an appropriate preconditioning step 23 is desirable before the matching step 24, as will be discussed briefly in relation to the main data types below.
For example, Inxight Summarizer is a software component technology that summarises a document by extracting key sentences from the document. This is the preconditioning step 23. These summaries can then be matched against each other in the matching step 24. Inxight Summarizer generates indicative summaries that contain key sentence, elements from a document.The essence of the text isextractedby stemming and text normalisation technology to obtain a concise and canonical synopsis of the text. "Stemming" is the replacement of a word by its root and part-of- speech (e.g. "I had wanted" -> "to want/first person/pluperfect"), whereas "normalisation" involves replacement of one of several forms with a "concept" ( e.g. "2/3/99, Feb 2" ,1999 and 2nd February" are all alternate forms of the same concept). The matching step 24 can then be carried out on the stemmed and normalised results of the preconditioning step 23 with confidence that text content which is genuinely similar will be matched without adverse influence from unwanted syntax considerations.
An example of an image searching tool is the IBM QBIC package, as indicated above. QBIC is further described at http://wwwqbic.almaden.ibm.com/. This package is adapted to precondition the images by analysing for a number of different criteria, such as colour percentages, colour layout, and textures occurring in the images. These criteria are then used in combination in a matching step 24. There are many other known applications of "searching a 'new' image for known objects, from robot vision (a robot searching for parts in a bin), through to traffic monitoring systems
(automatic detection of car license plates) - the present matching problem is essentially the inverse of these known problems.
It can be appreciated also that a serial approach could be used effectively: for example, first using a "straight edge" histogram to enable differentiation between natural and artificial scenes; then using an "edge length" histogram (an shortage of long edges probably indicates a natural scene); testing for a large area of blue tone at the top of the image (indicating an outdoor scene); and testing for significant elements of flesh tones", indicating that there is an image containing representations of people - which can be followed by a face matching analysis to find the same faces. Clearly a combination of serial and parallel steps can be employed.
The result of the similarity searching is a set of series of matching scores for documents in the database, such a set existing for each element searched. Each of these search scores needs to be normalised 25 for combination 26 to achieve a combined search result 27. The normalisation step 25 is to ensure that a correct balance is given to the results of the different searching steps 24. This can either be to weight each element of the document equally, to weight each element of the document according to its perceived importance in the document, or according to a user assessment of the relative importance of the different elements of the document.
A preferred solution may involve a mixture of automatic and manual weighting. A particularly effective approach is to use synopsis generation techniques on the textual part to produce a set of textual search criteria and also to present a set of possible criteria based on the non-textual parts. These criteria are then presented to the user for verification. Such a user based approach is easy to use (and it is also easy for a user to tell when it is ineffective). For example, auser may be asked if he/she wanted to search for things that matched the textual synopses, or, for the image and drawing parts, whether he wanted "this person", "scenes like this", "pictures containing this object"... or "pages that look like this one". The combined result 27 is as for conventional similarity searching: a series of matching scores (generally expressed as percentages) listing documents in the database from best towards worst matches.
Generally, most effective user querying will be achieved where it is possible for the user to achieve successive refinement of the user query - using the results of one round of querying as a basis for constructing the next round of querying - so in practice the combined result 27 will frequently be fed back to a later selection step to allow effective iterative searching.
Further use can be made of information derived from page decomposition in similarity searching. In addition to the separate elements provided by page decomposition (graphic 11, text block 12, and picture 13), further information is provided in the arrangement of the different elements within the document. As is shown in Figure 3, a further output available from page decomposition is a data type plan 31 representing the document as a line art block, a text block, and an image block, arranged vertically in sequence - decomposition into layouts is discussed is US Patent No. 6,002,798. However, the present inventors have appreciated that this data type plan can itself be used as a layout data type. This allows yet another element - the layout data type element - to be used in searching 32 of a database (provided that layout information is available in or derivable from the database entries). The results of similarity searching for such a layout element can be combined with similarity searches for other elements exactly as described in Figure 2., with layout data type 31 emerging from the decomposition step 22 and then being used in a searching step 32 equivalent and parallel to searching steps 23 and 24 (followed by a normalisation step before combination in step 26 with results from other data types.
In an embodiment according to the second aspect of the invention, similarity searching is conducted using the layout data type alone. The steps to be followed are essentially as in conventional similarity searching - this is shown in Figure 4, with elements common to the first aspect of the invention given the same reference numbers as in Figure 2. Layout similarity searching, whether used on its own or as one of the elements in a combined search as described in the first aspect of the invention, is more powerful if a number of different data types are used for text and for overall document type. Using a rule-based approach, different text blocks and whole documents, especially in the case of formal workflow documents, can be assigned particular functions with relatively high confidence. For example, it is well known that isolated text blocks at the top of a page and handwriting at the bottom are suggestive of a letter, and so different spatial regions of the document can be assigned to appropriate functional fields (address, letter text etc) - likewise, table and currency totals in a document can be identified as a discrete element, and their presence limits the document to another group (bill, quote or invoice). Layout searching can thus involve matching to templates representing different workflow document types (thus promoting matching of a document determined to be a letter against other letters). An appropriate mechanism is to normalise a layout for size, orientation and skew, and then carrying out an "exclusive or" operation on the query element and the layout records in the database - this will be effective provided that all records involved have a broadly common format.
The difficulty of this problem depends on the nature and type of documents that are to be considered for matching. If the "universe" of documents is well defined, then there are tools available that can do an accurate job of classifying and labelling within that universe (e.g. OfficeMaid from DFKI). What is required in this case is classification according to a set of conventions laid down for the various classes of documents available for consideration. Conventions are here essentially rules that need not be closely followed: consequently an appropriate approach to this problem is rule based (most conveniently using fuzzy rules). Training of a neural network would also be an effective approach to adopt. The skilled person will appreciate how conventional fuzzy rule or neural network approaches could be adapted for use in a solution to this problem.
The skilled man will appreciate that modifications of the embodiments described above can readily be carried out without departing from the invention as defined in the claims.

Claims

1. A method of searching a database to find documents similar to a query document, comprising:
decomposing the query document into elements of different data types;
for one or more of the elements in a first data type, conducting a first data type similarity search to return match results from the database for the one or more elements in the first data type;
for one or more of the elements in a second data type, conducting a second data type similarity search to return match results from the database for the one or more elements in the first data type;
combining the match results from the first data type similarity search and the second data type similarity search to provide query document match results.
2. A method as claimed in claim 1, wherein one of the data types is representative of text.
3. A method as claimed in claim 2, wherein a plurality of the data types are representative of text, separate data types of the plurality being representative of different functional blocks of text.
4. A method as claimed in any preceding claim, wherein one of the data types is representative of pictorial images.
5. A method as claimed in any preceding claim, wherein one of the data types is representative of graphical images.
6. A method as claimed in any preceding claim, wherein one of the data types is representative of the arrangement of other data types within the document.
7. A method as claimed in any preceding claim, wherein the step of similarity searching to return match results is carried out, separately, for a plurality of elements having between them more than two data types.
8. A method as claimed in any preceding claim, where all features of a common data type in the document are treated as one element.
9. A method as claimed in any of claims 1 to 7, where spatially distinct features of a common data type in the document are treated as separate elements.
10. A method as claimed in any preceding claim, wherein elements are user selectable or deselectable for the step of similarity searching.
11. A method as claimed in any preceding claim, wherein the similarity searching results for separate elements are weighted before combination.
12. A method as claimed in claim 11 , wherein said weighting is user selected.
13. A method as claimed in claim 11, wherein said weighting is attributed according to a determined significance of each relevant element in the document.
14. A method of searching a database to find documents similar to a query document, comprising:
decomposing the query document into elements of different data types;
determining a layout element in a layout datatype from the spatial arrangement of the elements in the document;
for the layout element, conducting a layout similarity search to return match results from the database for the layout element.
15. A method as claimed in claim 14, wherein the layout similarity search involves searching against templates representative, of different document types.
16. A method as claimed in claim 14, wherein the elements include elements of separate data types representative of different functional blocks of text.
17. A method as claimed in claim 14 or claim 16, wherein the elements include elements of data types representative of images.
PCT/GB2000/000489 1999-02-16 2000-02-15 Similarity searching by combination of different data-types WO2000049526A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP00903814A EP1072001A1 (en) 1999-02-16 2000-02-15 Similarity searching by combination of different data-types
JP2000600197A JP2002537604A (en) 1999-02-16 2000-02-15 Document similarity search

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB9903451.4A GB9903451D0 (en) 1999-02-16 1999-02-16 Similarity searching for documents
GB9903451.4 1999-02-16

Publications (1)

Publication Number Publication Date
WO2000049526A1 true WO2000049526A1 (en) 2000-08-24

Family

ID=10847827

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2000/000489 WO2000049526A1 (en) 1999-02-16 2000-02-15 Similarity searching by combination of different data-types

Country Status (4)

Country Link
EP (1) EP1072001A1 (en)
JP (1) JP2002537604A (en)
GB (1) GB9903451D0 (en)
WO (1) WO2000049526A1 (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002108936A (en) * 2000-10-03 2002-04-12 Canon Inc Information retrieving device, control method therefor and computer readable storage medium
WO2002077872A2 (en) * 2001-03-02 2002-10-03 United States Of America As Represented By The Administrator Of The National Aeronotics And Space Administration System, method and apparatus for discovering phrases in a database
EP1353278A2 (en) * 2002-04-10 2003-10-15 Software Engineering GmbH Comparison of source files
EP1473642A3 (en) * 2003-04-30 2005-11-02 Canon Kabushiki Kaisha Information processing apparatus, method, storage medium and program
EP1752895A1 (en) * 2005-08-09 2007-02-14 Canon Kabushiki Kaisha Image processing apparatus for image retrieval and control method therefor
WO2008129373A2 (en) * 2007-04-24 2008-10-30 Nokia Corporation Method, device and computer program product for integrating code-based and optical character recognition technologies into a mobile visual search
US7620721B2 (en) 2006-02-28 2009-11-17 Microsoft Corporation Pre-existing content replication
WO2011017557A1 (en) 2009-08-07 2011-02-10 Google Inc. Architecture for responding to a visual query
US7925676B2 (en) 2006-01-27 2011-04-12 Google Inc. Data object visualization using maps
US7953720B1 (en) 2005-03-31 2011-05-31 Google Inc. Selecting the best answer to a fact query from among a set of potential answers
US8055674B2 (en) 2006-02-17 2011-11-08 Google Inc. Annotation framework
US8065290B2 (en) 2005-03-31 2011-11-22 Google Inc. User interface for facts query engine with snippets from information sources that include query terms and answer terms
US8670597B2 (en) 2009-08-07 2014-03-11 Google Inc. Facial recognition with social network aiding
US8805079B2 (en) 2009-12-02 2014-08-12 Google Inc. Identifying matching canonical documents in response to a visual query and in accordance with geographic information
US8811742B2 (en) 2009-12-02 2014-08-19 Google Inc. Identifying matching canonical documents consistent with visual query structural information
US8935246B2 (en) 2012-08-08 2015-01-13 Google Inc. Identifying textual terms in response to a visual query
US8954426B2 (en) 2006-02-17 2015-02-10 Google Inc. Query language
US8977639B2 (en) 2009-12-02 2015-03-10 Google Inc. Actionable search results for visual queries
US9087059B2 (en) 2009-08-07 2015-07-21 Google Inc. User interface for presenting search results for multiple regions of a visual query
US9183224B2 (en) 2009-12-02 2015-11-10 Google Inc. Identifying matching canonical documents in response to a visual query
US9405772B2 (en) 2009-12-02 2016-08-02 Google Inc. Actionable search results for street view visual queries
US9530229B2 (en) 2006-01-27 2016-12-27 Google Inc. Data object visualization using graphs
US9852156B2 (en) 2009-12-03 2017-12-26 Google Inc. Hybrid use of location sensor data and visual query to return local listings for visual query
US9892132B2 (en) 2007-03-14 2018-02-13 Google Llc Determining geographic locations for place names in a fact repository

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19708265A1 (en) * 1996-03-01 1997-09-04 Ricoh Kk Search process for document-image database

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19708265A1 (en) * 1996-03-01 1997-09-04 Ricoh Kk Search process for document-image database

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HAMANO T: "A SIMILARITY RETRIEVAL METHOD FOR IMAGE DATABASES USING SIMPLE GRAPHICS", PROCEEDINGS OF WORKSHOP ON LANGUAGES FOR AUTOMATION,US,WASHINGTON, IEEE COMP. SOC. PRESS, vol. -, 1988, pages 149 - 154, XP000118740, ISBN: 0-8186-0890-0 *
MUKHERJEA S ET AL: "Towards a multimedia World-Wide Web information retrieval engine", COMPUTER NETWORKS AND ISDN SYSTEMS,NL,NORTH HOLLAND PUBLISHING. AMSTERDAM, vol. 29, no. 8-13, 1 September 1997 (1997-09-01), pages 1181 - 1191, XP004095315, ISSN: 0169-7552 *

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002108936A (en) * 2000-10-03 2002-04-12 Canon Inc Information retrieving device, control method therefor and computer readable storage medium
WO2002077872A2 (en) * 2001-03-02 2002-10-03 United States Of America As Represented By The Administrator Of The National Aeronotics And Space Administration System, method and apparatus for discovering phrases in a database
WO2002077872A3 (en) * 2001-03-02 2004-01-29 Nasa System, method and apparatus for discovering phrases in a database
EP1353278A2 (en) * 2002-04-10 2003-10-15 Software Engineering GmbH Comparison of source files
EP1353278A3 (en) * 2002-04-10 2005-01-12 Software Engineering GmbH Comparison of source files
CN100458773C (en) * 2003-04-30 2009-02-04 佳能株式会社 Information processing apparatus, method, storage medium and program
EP1473642A3 (en) * 2003-04-30 2005-11-02 Canon Kabushiki Kaisha Information processing apparatus, method, storage medium and program
US7593961B2 (en) 2003-04-30 2009-09-22 Canon Kabushiki Kaisha Information processing apparatus for retrieving image data similar to an entered image
US8650175B2 (en) 2005-03-31 2014-02-11 Google Inc. User interface for facts query engine with snippets from information sources that include query terms and answer terms
US8065290B2 (en) 2005-03-31 2011-11-22 Google Inc. User interface for facts query engine with snippets from information sources that include query terms and answer terms
US8224802B2 (en) 2005-03-31 2012-07-17 Google Inc. User interface for facts query engine with snippets from information sources that include query terms and answer terms
US7953720B1 (en) 2005-03-31 2011-05-31 Google Inc. Selecting the best answer to a fact query from among a set of potential answers
CN100414550C (en) * 2005-08-09 2008-08-27 佳能株式会社 Image processing apparatus for image retrieval and control method therefor
EP1752895A1 (en) * 2005-08-09 2007-02-14 Canon Kabushiki Kaisha Image processing apparatus for image retrieval and control method therefor
US7746507B2 (en) 2005-08-09 2010-06-29 Canon Kabushiki Kaisha Image processing apparatus for image retrieval and control method therefor
US9530229B2 (en) 2006-01-27 2016-12-27 Google Inc. Data object visualization using graphs
US7925676B2 (en) 2006-01-27 2011-04-12 Google Inc. Data object visualization using maps
US8954426B2 (en) 2006-02-17 2015-02-10 Google Inc. Query language
US8055674B2 (en) 2006-02-17 2011-11-08 Google Inc. Annotation framework
US7620721B2 (en) 2006-02-28 2009-11-17 Microsoft Corporation Pre-existing content replication
US9892132B2 (en) 2007-03-14 2018-02-13 Google Llc Determining geographic locations for place names in a fact repository
WO2008129373A2 (en) * 2007-04-24 2008-10-30 Nokia Corporation Method, device and computer program product for integrating code-based and optical character recognition technologies into a mobile visual search
WO2008129373A3 (en) * 2007-04-24 2008-12-18 Nokia Corp Method, device and computer program product for integrating code-based and optical character recognition technologies into a mobile visual search
US8670597B2 (en) 2009-08-07 2014-03-11 Google Inc. Facial recognition with social network aiding
US10515114B2 (en) 2009-08-07 2019-12-24 Google Llc Facial recognition with social network aiding
US10534808B2 (en) 2009-08-07 2020-01-14 Google Llc Architecture for responding to visual query
US10031927B2 (en) 2009-08-07 2018-07-24 Google Llc Facial recognition with social network aiding
US9208177B2 (en) 2009-08-07 2015-12-08 Google Inc. Facial recognition with social network aiding
US9087059B2 (en) 2009-08-07 2015-07-21 Google Inc. User interface for presenting search results for multiple regions of a visual query
US9135277B2 (en) 2009-08-07 2015-09-15 Google Inc. Architecture for responding to a visual query
WO2011017557A1 (en) 2009-08-07 2011-02-10 Google Inc. Architecture for responding to a visual query
US8811742B2 (en) 2009-12-02 2014-08-19 Google Inc. Identifying matching canonical documents consistent with visual query structural information
US9405772B2 (en) 2009-12-02 2016-08-02 Google Inc. Actionable search results for street view visual queries
US9183224B2 (en) 2009-12-02 2015-11-10 Google Inc. Identifying matching canonical documents in response to a visual query
US9087235B2 (en) 2009-12-02 2015-07-21 Google Inc. Identifying matching canonical documents consistent with visual query structural information
US8977639B2 (en) 2009-12-02 2015-03-10 Google Inc. Actionable search results for visual queries
US8805079B2 (en) 2009-12-02 2014-08-12 Google Inc. Identifying matching canonical documents in response to a visual query and in accordance with geographic information
US9852156B2 (en) 2009-12-03 2017-12-26 Google Inc. Hybrid use of location sensor data and visual query to return local listings for visual query
US10346463B2 (en) 2009-12-03 2019-07-09 Google Llc Hybrid use of location sensor data and visual query to return local listings for visual query
US9372920B2 (en) 2012-08-08 2016-06-21 Google Inc. Identifying textual terms in response to a visual query
US8935246B2 (en) 2012-08-08 2015-01-13 Google Inc. Identifying textual terms in response to a visual query

Also Published As

Publication number Publication date
JP2002537604A (en) 2002-11-05
EP1072001A1 (en) 2001-01-31
GB9903451D0 (en) 1999-04-07

Similar Documents

Publication Publication Date Title
WO2000049526A1 (en) Similarity searching by combination of different data-types
US6029167A (en) Method and apparatus for retrieving text using document signatures
Lesk Practical digital libraries: Books, bytes, and bucks
US6826576B2 (en) Very-large-scale automatic categorizer for web content
US7809695B2 (en) Information retrieval systems with duplicate document detection and presentation functions
US8296295B2 (en) Relevance ranked faceted metadata search method
US6741743B2 (en) Imaged document optical correlation and conversion system
US7801893B2 (en) Similarity detection and clustering of images
EP1585073B1 (en) Method for duplicate detection and suppression
US8566305B2 (en) Method and apparatus to define the scope of a search for information from a tabular data source
US5926565A (en) Computer method for processing records with images and multiple fonts
US6178417B1 (en) Method and means of matching documents based on text genre
US20080010263A1 (en) Search engine
US20090327250A1 (en) Method and apparatus for searching and resource discovery in a distributed enterprise system
US20080147642A1 (en) System for discovering data artifacts in an on-line data object
Shin et al. Document Image Retrieval Based on Layout Structural Similarity.
Aslandogan et al. Evaluating strategies and systems for content based indexing of person images on the Web
WO2008005493A2 (en) Relevance ranked faceted metadata search method and search engine
CN116450913A (en) Retrieval method, retrieval device, server and computer readable storage medium
JP3841318B2 (en) Icon generation method, document search method, and document server
JP2000231560A (en) Automatic document classification system
Van Der Merwe The integration of document image processing and text retrieval principles
Gelfand et al. Discovering concepts in raw text: Building semantic relationship graphs
CN116226333A (en) Auxiliary recitation method, device and equipment
van der Merwe Article zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

WWE Wipo information: entry into national phase

Ref document number: 2000903814

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWP Wipo information: published in national office

Ref document number: 2000903814

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 09647266

Country of ref document: US

WWW Wipo information: withdrawn in national office

Ref document number: 2000903814

Country of ref document: EP