WO2008130501A1 - Traitement et recherche de documents non structurés ou semi-structurés et génération d'information en fonction de valeurs - Google Patents

Traitement et recherche de documents non structurés ou semi-structurés et génération d'information en fonction de valeurs Download PDF

Info

Publication number
WO2008130501A1
WO2008130501A1 PCT/US2008/004545 US2008004545W WO2008130501A1 WO 2008130501 A1 WO2008130501 A1 WO 2008130501A1 US 2008004545 W US2008004545 W US 2008004545W WO 2008130501 A1 WO2008130501 A1 WO 2008130501A1
Authority
WO
WIPO (PCT)
Prior art keywords
product
document
products
recited
feature
Prior art date
Application number
PCT/US2008/004545
Other languages
English (en)
Inventor
Aditya Vailaya
Jiang Wu
Manish Rathi
Kirk Chen
Original Assignee
Retrevo, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/737,660 external-priority patent/US8504553B2/en
Priority claimed from US11/737,668 external-priority patent/US8290967B2/en
Priority claimed from US11/737,684 external-priority patent/US7917493B2/en
Priority claimed from US11/963,684 external-priority patent/US20080255925A1/en
Application filed by Retrevo, Inc. filed Critical Retrevo, Inc.
Publication of WO2008130501A1 publication Critical patent/WO2008130501A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Definitions

  • the present invention relates to document processing and/or searching and/or presentation, and more particularly, this invention relates to document preparation, indexing, and/or searching and/or presentation of information.
  • large documents such as product related documents
  • the result is that users must first locate a document, and then open the document in a specific document reader, e.g. a PDF reader, and then manually search again within the document to find the right section and page for the answers.
  • a specific document reader e.g. a PDF reader
  • a method for analyzing and indexing an unstructured or semistructured document includes receiving an unstructured or semistructured document; converting the document to one or more text streams; analyzing the one or more text streams for identifying textual contents of the document; analyzing the one or more text streams for identifying logical sections of the document; associating the textual contents with the logical sections; indexing the textual contents and their association with the logical sections; and saving a result of the indexing in a data storage device.
  • the unstructured or semistructured document may be in a printer format, may be a binary representation of dark and light areas of a scanned document, may not contain format markers, etc.
  • a context of the unstructured document is identified and meta data representing a context of at least some of the sections is generated.
  • the meta data may be indexed.
  • the context information may be based on a word extracted from the document and matched to a term in a context-related dictionary.
  • the textual contents include words in the document, but could include other objects, markings, symbols, etc.
  • the sections include groups of paragraphs of the document, each paragraph being individually detected by analyzing the one or more text streams.
  • Taxonomy-related information may be associated with the textual contents, and indexed.
  • the indexing includes assigning a weight to the textual contents. For example, an index of the document may be analyzed, and a higher weight may be assigned to a textual content matching a term in the index of the document and present on a page of the document pointed to in the index in association with the term.
  • a table of contents and/or page numbers may also be extracted from the unstructured or semistructured document.
  • the physical page numbers in the unstructured or semistructured document may be identified, and logical page numbers mapped to the physical page numbers.
  • a method for analyzing an unstructured or semistructured document includes receiving an unstructured or semistructured document; converting the document to one or more text streams; analyzing the one or more text streams for identifying paragraphs of the document; grouping the paragraphs into sections; and outputting the sections, or derivative thereof, to at least one of a user, another system, and another process.
  • Page numbers may be extracted from the document, and the sections associated with the page numbers.
  • the paragraphs are identified by: determining geometry information about lines of the text streams; placing the lines into blocks based on a proximity of the lines relative to each other; and analyzing the blocks for joining lines into paragraphs.
  • Boundaries of the sections may be determined at least in part based on an analysis of a table of contents of the document.
  • a method for analyzing and indexing an unstructured or semistructured document includes receiving an unstructured or semistructured document; converting the document to one or more text streams; analyzing the one or more text streams for identifying textual contents of the document; identifying logical sections of the document; associating the textual contents with the sections; analyzing the one or more text streams for identifying context information about each section; indexing the textual contents, the context information, and the association of the textual contents and context information with the logical sections; and saving a result of the indexing in a data storage device.
  • a method for processing a search query includes receiving a search query containing terms; looking up at least some of the terms in a search index for identifying sections of documents containing the at least some of the terms; generating a content score for each of the documents based at least in part on a number of keywords found in the sections of each document; looking up at least some of the terms in the search index for attempting to match one or more of the terms to context information in the search index, the context information being associated with at least one of the documents; generating a context score based at least in part on the matching of terms to the context information; generating a document score for each of the documents based at least in part on the content score and the context score; and outputting an indicator of at least one of the documents, or portion thereof, for the at least one of the documents having a higher document score relative to other of the documents.
  • the looking up at least some of the terms in a search index for identifying sections of a document containing the at least some of the terms may include identifying paragraphs in the sections containing the at least some of the terms.
  • a paragraph score is calculated for each of the paragraphs based at least in part on a number of the terms appearing in each of the paragraphs, wherein a section score is calculated based on the paragraph scores of the paragraphs in the section.
  • the document score is calculated based at least in part on the sections scores of the sections of the document.
  • An indicator of at least one of the sections having a higher paragraph score relative to other of the sections may be output.
  • a weighting is applied to each keyword found in the sections of the documents, the weighting affecting the content score. Also preferably, a weighting is applied to each keyword matching the context information, the weighting affecting the context score.
  • the indicator may be of a section of the at least one of the documents.
  • a method for processing a search query includes receiving a search query containing terms; looking up at least some of the terms in a search index for identifying sections of documents containing the at least some of the terms; generating a content score for each of the sections based at least in part on a number of keywords found in the sections of each document; and selecting and outputting an indicator of at least one of the sections, or portion thereof, based at least in part on the content score.
  • At least some of the terms in the search index may be looked up for attempting to match one or more of the terms to context information in the search index, the context information being associated with at least one of the documents; and generating a context score based at least in part on the matching of terms to the context information; wherein the selection of the at least one of the sections, or portion thereof, is also based at least in part on the context score.
  • an indicator of a paragraph of at least one of the sections is selected and output.
  • the selecting and outputting of the indicator of the at least one of the sections, or portion thereof may be further based at least in part on types of the documents.
  • the method may further comprise determining whether the search query includes a product identifier or portion thereof associated with a product, and if so, filtering out documents not relating to the product associated with the product identifier.
  • the method may further comprise determining whether the search query is directed to an entire document rather than one or more sections thereof, and if so, selecting and outputting an indicator of the document instead of the at least one of the sections, or portion thereof, of the document.
  • An index structure for keyword searches is embodied on a computer readable medium.
  • the index structure includes a plurality of content words; for each of the content words, at least one document identifier containing information about a document containing the content word; and for each of the document identifiers, at least one position identifier containing information about a section in the document containing the content word.
  • At least some of the position identifiers further contain information about a paragraph in the section of the document containing the content word. In another approach, at least some of the position identifiers further contain information about a page in the document containing the content word.
  • At least some of the position identifiers include a weighting value of the content word.
  • the weighting value may be based at least in part on a position of the content word in the document.
  • the index may also include context meta data associated with at least some of the documents, the context meta data indicating a context of the documents associated therewith.
  • the context meta data is weighted.
  • a method for indexing a product identifier and logical parts thereof includes receiving a product identifier; splitting the product identifier into logical parts; indexing the product identifier and the individual logical parts in association with a particular document or portion thereof in an index; and storing the index.
  • a method for indexing a product identifier and variations thereof according to another embodiment of the present invention includes receiving a product identifier; splitting the product identifier into logical parts; indexing the product identifier and alternate combinations of the logical parts in association with a particular document or portion thereof in an index; and storing the index.
  • the product identifier comprises multiple logical parts separated by a space, at least some of the logical parts may be indexed as a single consecutive character string.
  • the product identifier comprises multiple logical parts separated by a punctuation mark, at least some of the logical parts may be indexed as a single consecutive character string.
  • the product identifier is indexed in a field for full terms, wherein the logical parts are indexed in a field for partial terms.
  • the logical parts may include an alphabetic part and a numeric part of the alphanumeric character string.
  • the alphabetic part and the numeric part may each be indexed in a field for partial strings.
  • the alphabetic part of an alphanumeric character string may also or alternatively be indexed in a field for alphabetic strings.
  • a method for processing a search query includes receiving a search query containing one or more terms; searching a search index containing complete product identifiers and variations thereof for attempting to match the one or more terms to the product identifiers or the variations thereof; and if one or more of the terms matches a complete product identifier or variation thereof, selecting and outputting an indicator of a document, or portion thereof, associated with the matching product identifier.
  • the variations of the product identifiers may include at least one of: parts of the product identifiers, continuous character strings, reordered logical parts of the product identifiers, alphabetical characters only, and numerical characters only.
  • a method is provided for generating value-based information. In use, statistical data is generated for particular features of a plurality of products based on prices of the products. Additionally, a base score for each of the features is generated based on the statistical data. Further, for each of at least some of the products, a product feature score is computed for the product based on the base scores of the features that the product has. Further still, for the at least some of the products, a representation of a value of each of the at least some of the products in relation to each other is output, where the representation of the value is based on the product feature score and the price for each of the products.
  • a method for displaying product information.
  • a feature to price distribution is approximated for each of a plurality of features of a plurality of products.
  • a product feature score is calculated for each of at least a subset of the products.
  • data corresponding to a visual representation of the at least a subset of the products in relation to each other is output based on the product feature scores and prices of each of the at least a subset of the products.
  • a method for displaying product information, in accordance with another embodiment.
  • a value is assigned to each of a plurality of features of a plurality of products.
  • a product feature score is calculated for each of at least a subset of the products.
  • data corresponding to a visual representation of the at least a subset of the products in relation to each other is output based on the product feature scores and prices of each of the at least a subset of the products.
  • a method for displaying product information in accordance with yet another embodiment.
  • a value of each of a plurality of products relative to the other products is determined, where the values are based on features and prices of the products.
  • data corresponding to a visual representation of the products in relation to each other is output based on the value of the products in relation to each other.
  • a method for displaying product information in accordance with still yet another embodiment.
  • a value of each of a plurality of products relative to the other products is determined, where the values are based on features and prices of the products.
  • data corresponding to a visual representation of the products in relation to each other is output on a plot of features vs. product price based on the value of the products in relation to each other.
  • user input specifying a subset of the products is received, and data corresponding to a visual representation of the subset of products is output.
  • data corresponding to an additional visual representation listing a select number of the features of the product is output, and data corresponding to a link to additional information about the product is output, wherein a subset of the visual representations are highlighted based on defined criteria.
  • FIG. 1 illustrates a flow diagram of a method for analyzing and indexing an unstructured or semistructured document in accordance with one embodiment of the present invention.
  • FIG. 2 illustrates a preferred embodiment for processing, indexing, and searching an unstructured or semistructured document in accordance with one embodiment of the present invention.
  • FIG. 3 illustrates a preferred embodiment of the data model used for processing product documents in accordance with one embodiment of the present invention.
  • FIG. 4 illustrates a data model utilized in the conversion of a PDF to an
  • FIG. 5 illustrates an internal record organization in accordance with one embodiment of the present invention.
  • FIG. 6 demonstrates the relationship between a PDF char stream and a logical char stream in accordance with one embodiment of the present invention.
  • FIG. 7 illustrates a flow diagram of a method for analyzing and indexing an unstructured or semistructured document in accordance with another embodiment of the present invention.
  • FIG. 8 illustrates a flow diagram of a method for processing a search query in accordance with one embodiment of the present invention.
  • FIG. 9 illustrates one embodiment of the present invention.
  • FIG. 10 illustrates an example of a traditional lookup, merge, and sort which may be implemented in an embodiment of the current invention.
  • FIG. 11 illustrates an example of a preferred embodiment for content lookup, merge, and sort in accordance with one embodiment of the present invention.
  • FIG. 12 illustrates a flow diagram of a method for processing a search query in accordance with one embodiment of the present invention.
  • FIG. 13 illustrates a landing page in accordance with one embodiment of the present invention.
  • FIG. 14 illustrates a landing page implementation in accordance with one embodiment of the present invention.
  • FIG. 15 illustrates a search result page in accordance with one embodiment of the present invention.
  • FIG. 16 illustrates an implementation of the search results page displaying a PDF document page in accordance with one embodiment of the present invention.
  • FIG. 17 illustrates an example of the search results page displaying content from a web site in accordance with one embodiment of the present invention.
  • FIG. 18 illustrates an example of the process of submitting a query and display results between the client browser and the server in accordance with one embodiment of the present invention.
  • FIG. 19 is a flow diagram of a method for indexing a product identifier and logical parts thereof in accordance with one embodiment of the present invention.
  • FIG. 20 is a flow diagram of a process for indexing a product identifier and variations thereof in accordance with one embodiment of the present invention.
  • FIG. 21 illustrates a network architecture in accordance with one embodiment of the present invention.
  • FIG. 22 shows a representative hardware environment, in accordance with one embodiment of the present invention.
  • Figure 23 illustrates a network architecture, in accordance with one embodiment.
  • Figure 24 shows a representative hardware environment that may be associated with the servers and/or clients of Figure 23, in accordance with one embodiment.
  • Figure 25 shows a method for generating value-based information, in accordance with one embodiment.
  • Figure 26 a method for displaying product information, in accordance with another embodiment.
  • Figure 27 shows a method for displaying product information, in accordance with yet another embodiment.
  • Figure 28 shows a method for displaying product information, in accordance with still another embodiment.
  • Figure 29 shows a method for displaying product information, in accordance with still yet another embodiment.
  • Figure 30 shows an exemplary embodiment of a box diagram, in accordance with one another embodiment.
  • Figure 31 shows a "baseline" feature level graph, in accordance with one embodiment.
  • Figure 32 shows a "baseline" feature level graph with overlap, in accordance with one embodiment.
  • Figure 33 shows an example of a graph of a feature with a large standard deviation, in accordance with one embodiment.
  • Figure 34 shows an example of a graph of a feature with a small standard deviation, in accordance with one embodiment.
  • Figure 35 shows a feature-price chart, in accordance with one embodiment.
  • Figure 36 shows an analysis of the significance of product location on a feature-price chart, in accordance with one embodiment.
  • Figure 37 shows a second display which may accompany a feature-price chart, in accordance with one embodiment.
  • Figure 38 shows a description including a simple list, in accordance with one embodiment.
  • Figure 39 shows a product snapshot, in accordance with another embodiment.
  • Figure 40 shows a product snapshot manufacturer grouping, in accordance with one embodiment.
  • Figure 41 shows an example of filtering by attributes, in accordance with one embodiment.
  • Figure 42 shows an example of drilling down, in accordance with one embodiment.
  • FIG. 43 shows an example of advanced product navigation, in accordance with one embodiment.
  • PDF Portable Document Format
  • FIG. 1 illustrates a flow diagram of a method 100 for analyzing and indexing an unstructured or semistructured document in accordance with one embodiment of the present invention.
  • a document is received in step 102.
  • the document may be unstructured or semistructured.
  • the unstructured or semistructured document may be in a printer format, such as Portable Document Format (PDF), or PostScript (PS) format, etc.
  • PDF Portable Document Format
  • PS PostScript
  • the unstructured or semistructured document may also be a binary representation of dark and light areas of a scanned document. Further, the unstructured or semistructured document may not contain format markers. No information may be known about these documents, e.g. how lines of text fit together into paragraphs and sections, etc. Examples of unstructured or semistructured documents may include user manuals for electronic devices, product specification sheets, etc.
  • step 104 the document is converted to one or more text streams.
  • the one or more text streams are analyzed for identifying textual contents of the document.
  • the textual contents may include words in the document.
  • the one or more text streams are analyzed for identifying logical sections of the document.
  • the sections may include groups of paragraphs of the document, each paragraph being individually detected by analyzing the one or more text streams. An extraction process may be performed in order to assist in this identification.
  • the textual contents are associated with the logical sections.
  • the textual contents and their association with the logical sections are indexed. Further still, the indexing may include assigning a weight to the textual contents.
  • a result of the indexing is saved in a data storage device, for example a nonvolatile memory device, e.g., hard disk drive; volatile memory; etc.
  • a data storage device for example a nonvolatile memory device, e.g., hard disk drive; volatile memory; etc.
  • the content of the document is stored inside an index.
  • Each word from the content may be further tagged with the section and paragraph from which the word comes from.
  • one or more text streams may be analyzed for identifying context information about each section, and the context information and the association of the textual contents and context information may be indexed with the logical sections.
  • FIG. 2 One preferred embodiment for processing, indexing, and searching an unstructured or semistructured document is shown in FIG. 2.
  • a PDF document 202 in this example is converted to an Extensible Markup Language (XML) format document 206 in step 204.
  • XML Extensible Markup Language
  • the conversion process extracts the text elements and the bookmark information from the PDF file. Bookmark information is used later to create or assist in the creation of Table of Contents (TOC) entries.
  • TOC Table of Contents
  • the vendor module 302 comprises a vendor id and name of the products produced by the vendor. No two vendors will typically have the same vendor id.
  • the product module 304 comprises a product id, a model number, a Universal Product Code (UPC), and a description. No two products may have the same product id.
  • the product module 304 may also contain information regarding one or more vendors that produced the product (may be more than one due to possible merger or acquisition) as well as information on the documents that are written for this product.
  • the document module 306 comprises a document id, URL and other meta data about the document, and the products that this document is written for (a document can be written for multiple products, typically when the multiple products are variations of a single product model). No two documents may have the same document id.
  • the document module 306 may also contain Table of Contents (TOC) entries, index entries, sections, and pages.
  • the index entry module 308 contains document information and an index id, where no two index entries may have the same index id and document information.
  • the index entry module 308 also contains the text of the index entry, among other information.
  • the TOC entry module 310 contains document information and an entry id, where no two TOC entries may have the same entry id and document information.
  • the entry module 310 further comprises the title of the TOC entry, subsections under this TOC entry, and a parent TOC entry that contains the current TOC entry.
  • the section module 312 contains document information and a section id, where no two section entries may have the same section id and document information.
  • the section module 312 also contains a TOC entry for the section as well as paragraphs belonging to the section.
  • the paragraph module 314 contains section information and a paragraph id, where no two index entries may have the same paragraph id and section information.
  • the paragraph module 314 further comprises the text of the paragraph and the starting page for the paragraph. Additionally, the page module 316 contains document information and a page id, where no two page entries may have the same page id and document information. The page module 316 also contains a local page number and the paragraphs that start on the page.
  • XML document is shown in FIG. 4.
  • the data model contains data regarding one or more of pagelabel 402, fontspec 404, page 406, and link 408.
  • the data model further contains data regarding one or more of block 410, line 412, and text 414.
  • Table 1 defines the aforementioned data model and illustrates one example of the output of the conversion to an XML file.
  • Software tools such as xpdf (available from http://www.foolabs.com/xpdf) use a similar output format for the XML file.
  • a PDF format document only contains layout information, for example text geometry and font size.
  • the document does not contain logical information such as section, paragraph, and sentences.
  • For each text segment extracted the segment's geometry and font information is saved.
  • a text segment is a single sequence of characters of a particular font family, size and style. Then, the text segments that are close together are combined to form lines. Each line is made of multiple text segments along the line's writing axis. Finally, multiple lines that are closer together are placed inside a text block.
  • the page numbers may be extracted from the document, resulting in meta data 216.
  • Document pages have a physical page index as well as a logical page number.
  • Page numbers can be numeric such as 1, 2, 3, etc. or have prefixes such as 1-1, 1-2, 2-1, 2-2, or non-numeric such as i, ii, iii, a, b, aa, bb, etc.
  • Files such as PDF files can embed page number information. When embedded, the PDF to XML conversion may issue pagelabel attributes.
  • pattern extraction may be used to determine the page numbering.
  • the lines on each page are sorted by the primary rotation.
  • the primary rotation indicates whether the text on the page is mostly facing up, facing right, facing down, or facing left. From this point on, the word “top” and “bottom” are used with respect to the primary rotation. For example, if the primary rotation is facing right, then “top” means “rightmost” and “bottom” means “left-most”.
  • bottom-most edge of the page are selected. Then each line is placed inside one of multiple locations along the bottom edge.
  • the example, illustrated in Table 2, uses 6 locations, though more or less may be implemented.
  • numeric page numbers may be any text in one of the following formats, as shown in Table 3.
  • ⁇ prefix> - ⁇ number> E.g. 1-1
  • A-I ⁇ text> ⁇ number> E.g. 1, 20, 2 Getting Started, Getting Started 2
  • the page numbers for the documents can be determined using the following tests:
  • the page number is extracted from the corresponding text located at the right location from each page to produce a list of logical page numbers. If the offset test fails for all 6 scenarios, then the "top-most" line is used to see if the page numbering occurs at the top of the page. For detection at the top of the page, a similar set of steps is followed: duplicate elimination, place text into multiple separate locations, compute offsets, and look for offset patterns.
  • paragraphs may be extracted from the document.
  • PDF documents do not contain information about logical paragraphs.
  • the geometry information about lines may be output. Lines are then placed inside blocks based on the line proximity to each other with respect to their font size. These blocks may be analyzed for joining lines into paragraphs.
  • the primary rotation of a page is determined. The primary rotation is the direction on which most of the text is written. From this point on, all directions are with respect to the primary rotation. For example, if most of the text is facing right, then the primary rotation is up.
  • each page is split into multiple logical pages, and each logical page is divided into columns.
  • the blocks are ordered yx, meaning order by ascending primary rotation dimension, followed by left-to-right for the perpendicular dimension. Further, iterations are performed through all blocks, for block N:
  • the line style is bold or italic if all text within are bold or italic.
  • the font size of a line is the font size of the first text segment inside the line.
  • the Table of Contents may be extracted from the document, resulting in meta data 216.
  • the TOC is extracted in various ways: by reading the bookmarks from a PDF file, by analyzing the text in the PDF file, or both.
  • Some PDF documents have embedded bookmark information. Each bookmark is a link with a title and a physical page index pointing to a physical coordinate on a page. This information, when present, is outputted to the XML file during the PDF to XML conversion.
  • TOC extraction is performed after a page is divided into logical pages and each of the logical pages is divided into columns. TOC extraction then operates on each of the columns.
  • TOC To create the TOC, starting on the first page a search is performed for a block with the text "Contents" or "Table of Contents" within the columns of the page. Once found, it represents the start of the TOC. Then a search continues from that point to look for lines with the text "Index", or have a font size greater than or equal to the TOC heading size. The lines between the starting and the ending points are then processed for TOC entries.
  • each line is analyzed to determine if it ends with a number. If it does, then the text before the number represents the entry title with the number representing the logical page number. If a line does not end with a number, then the next line is analyzed to see if it has the same font and size. If it does, then the entry is made of two lines with the second line being a continuation of the first line. The text is joined between the two lines and checking continues to see if there is a page number for this entry. Otherwise, the line is not used as a TOC entry because it has no page number associated with it.
  • TOC entries of the same font family and size are considered to be at the same level.
  • Each TOC entry can be a member of another TOC entry. This is accomplished by adding the current TOC entry to the last TOC entry that has a different font family and size in the next location on the stack. When no parent entry is found, then this TOC entry is at the root level. At the end of this process, the TOC entries form a hierarchical tree.
  • a TOC tree produced through text pattern analysis contains logical page numbers while a TOC tree produced through embedded bookmarks contains physical page index. Hence, a TOC tree produced through text pattern analysis may require page number detection as well so that these logical page numbers can be mapped to physical page indexes.
  • step 212 section extraction is performed on the document, resulting in section data 214.
  • Section extraction is performed using the extracted TOC of the document to detect section boundaries. If the TOC is extracted via text analysis, then the page number for each TOC entry is first mapped to a physical page index using the extracted page numbers.
  • Section extraction starts at the first paragraph after the end of the TOC.
  • the first paragraph after the TOC is used to compare against each TOC entry to find the first matching entry. This anchors the first section after the TOC section to a particular TOC entry in the TOC hierarchy.
  • each paragraph on the subsequent pages are scanned against the next several, e.g. 2, 3, 4... TOC entries. Assume the system uses 3 TOC entries. During this scanning, a particular page may be jumped to using the page index associated with the TOC entries. If any one of the 3 TOC entries is found to match a paragraph exactly, it is then determined if a fuzzy match needs to be performed. A fuzzy match is required when no exact match is found for the first or the first two TOC entries. In the case of fuzzy match, the first TOC entry is fuzzy compared against each paragraph between the previous exact TOC match and the current exact TOC match.
  • the fuzzy match is continued using the 2 nd TOC entry against the paragraphs between the first fuzzy matched paragraph and the current exact TOC matched paragraph.
  • the fuzzy compare used may be based any string similarity algorithms, e.g. Hamming, Levenshtein, etc.
  • the section file is a binary file containing one record for each section.
  • the section file is named ⁇ PDF filename>.section. Records are packed next to each other without any gaps. Integers stored in the record are 32 bit in Big Endian byte order.
  • the section offset represents an offset into the paragraph file, in 256 bytes.
  • the parent index represents the parent section index.
  • the UTF length represents the byte length of the title field And the title represents the title of the section.
  • a corresponding file called ⁇ PDF filename.section.txt is also generated that contains the text version of the file. This file is used for debugging.
  • the paragraph file is a binary file containing records for each paragraph.
  • the paragraph file is named ⁇ PDF filename>.para.
  • a record in the paragraph file may start on the next available 256 boundary. Integers stored in the record are 32 bit in Big Endian byte order. Strings stored in the record are in UTF-8 format. As a result, there may be gaps between two records. For each paragraph record, the following information is recorded, as shown in Table 5.
  • the section index represents the section this paragraph belongs to, and the paragraph offset represents the byte offset in 256 bytes, relative to the starting of the section.
  • This field and the section index are stored together inside the first 32 bit int.
  • the page index represents the page index on which this paragraph starts, and the UTF length represents the byte length of the paragraph text.
  • the paragraph text represents the text of the paragraph in UTF-8 format, and the padding represents the bytes used to pad the record to the next 256 byte boundary.
  • the section index and the paragraph offset are encoded into a single 32 bit integer. See the section on Index creation on the format of this integer.
  • Table 6 illustrates an example of how the padding is computed.
  • the section offset is stored as part of the section binary file, as this storage format helps to optimize skipping to a specific paragraph of a specific section. For example, to read Para 2 of Section 1 in the file above, one can compute the byte offset by performing the following computation, as shown in Table 7:
  • FIG. 5 depicts the internal record organization and the relationship between the records from the section file 502 and the paragraph file 504, as well as the PDF file 506.
  • Document Index Extraction [00141] Referring again to FIG. 2, in step 208 the document index may be extracted from the document, resulting in meta data 216. Many documents have an Index section toward the end. The index contains useful terms for the document and where in the document the term appears. This index information may be extracted and the word boosted from the index entry on the page the index entry points to. Because the index entry points to logical page numbers, the page numbers may be used to map the logical number to the corresponding physical page index. [00142) Taxonomy Extraction
  • taxonomy-related information may be extracted from the document, resulting in taxonomy information 232.
  • Taxonomy- related information may include identification information such as product vendor, product identifier (product model number, product name), etc., and may be associated with the textual contents of the document. The context of the document may also be correlated to taxonomy-related information. Additionally, the taxonomy information may be indexed.
  • a char offset index may be extracted from the document for keyword highlighting.
  • it may be desirable to highlight keywords that are entered as a part of the query syntax. It may therefore be desirable to extract the PDF char offset information for the matching keyword in order to perform the keyword highlighting.
  • the PDF char offset is then sent to the PDF render, which paints a rectangle with background color before painting the character producing the highlighting effect.
  • chunk 2 is not in sequential order after chunk 1 inside the PDF file. If the search query keyword is "lazy”, then its corresponding PDF char offset 55 may need to be determined in order to instruct the PDF render to highlight the word "lazy” in the above sentence. Also note that the ⁇ space> characters between "jumps over” and “over a” are not present in the PDF file. These ⁇ space> characters are artificially introduced in the logical paragraph.
  • a char offset index is constructed to map between the logical char offset to PDF character offset.
  • the PDF to xml generator may first produce the PDF char offset for every text chunk from the PDF file. Then during the construction of the logical paragraphs, a char offset index may be generated.
  • the char index file is an ordered list of tuples representing key value pairs, as shown in Table 9.
  • the (paragraph offset, logical char offset) is the key.
  • the logical char offset is the character position relative to the first character of the chunk within the paragraph.
  • the first character of a paragraph has the logical char offset of 0. See previous sections on how the paragraph offset is computed.
  • the corresponding part of the char offset index file may contain the following, shown in Table 10.
  • FIG. 6 demonstrates the relationship between a PDF char stream 602 and a logical char stream 604.
  • the context of a document may be extracted from the document, resulting in context meta data 220, which may be indexed.
  • the context for a document may include the list of products this document is written for. This context can be provided manually by the person who first obtained this document or extracted from the site map where the document was downloaded from. Alternatively, words inside the title page of the document or the URL link of the document can be taken and looked up against a well-defined dictionary of vendor name and product model numbers. For each product identified, the following information is saved as shown in Table 11.
  • vendor - list of vendor (or vendors if multiple) who created this product • vendor - list of vendor (or vendors if multiple) who created this product. e.g. Sony, Nikon
  • model # - model number of this product e.g. dc4800, e995
  • the UPC and description may be obtained from a separate product information catalog after the product is identified by vendor and model number.
  • the context information obtained is then stored in as a set of metadata associated with this document.
  • a searchable index 224 is created and updated on both the content and the context utilizing an update index function 226 and index updater function 228.
  • the indexer may be adapted from Open Source Lucene.
  • the content index is stored inside a Lucene field called "content" while the context information is stored in various other fields.
  • the doc id points to the documents that contain the given word. Then for each document, the position id for a document is used traditionally to point to the offsets within the document on where this word occurs.
  • each word's position information may be manipulated such that it is used to encode the section and the paragraph location of the given word.
  • the position id is a 32 bit integer.
  • the 32 bits are divided into 3 bit sets, as shown in Table 13.
  • the most significant bit, the sign bit, is not used.
  • the next 13 bits are used to store the section id within which a word comes from.
  • the next 16 bits are used to store the paragraph chunk offset into the paragraph file.
  • a paragraph file stores each paragraph on a 256 byte chunk alignment.
  • each section can have a maximum of 65536 paragraphs.
  • the size of each paragraph is unlimited. However, the minimum amount of space taken up by a paragraph is 256 bytes inside the paragraph file.
  • the paragraph file layout is shown in Table 14.
  • the least significant 2 bits are used to store a priority value associated for the given word.
  • the value 0 is the default.
  • Other values are used to encode the importance of the word.
  • bi-words can be configured to have a priority value of 1. During scoring, a different weight is associated to bi-words by checking if a matched word has a priority value of 1.
  • the document context is one or more products this PDF file is written for, the type of the document, and other meta information.
  • Each product can be described by one or more identifiers, such as UPC, vendor name, and model number, and a product description.
  • a context index is created by encoding these context meta data into a special section 0 of the document.
  • FIG. 7 is a flow diagram of a method 700 for analyzing and indexing an unstructured or semistructured document in accordance with another embodiment of the present invention.
  • the process 700 may be implemented in the context of the architecture and environment of FIGs. 1-6. Of course, however, the process 700 may be carried out in any desired environment.
  • an unstructured or semistructured document is received in step 702. Additionally, in step 704 the document is converted to one or more text streams. Further, in step 706 the one or more text streams is analyzed for identifying paragraphs of the document, and in step 708 the paragraphs are grouped into sections. Further still, in step 710 the sections, or a derivative thereof, are output to at least one of a user, another system, and another process.
  • page numbers may be extracted from the document, and the sections may be associated with the page numbers. Also, the boundaries of the sections may be determined at least in part based on an analysis of a table of contents of the document.
  • FIG. 8 is a flow diagram of a method 800 for processing a search query in accordance with one embodiment of the present invention.
  • the process 800 may be implemented in the context of the architecture and environment of FIGs. 1-7. Of course, however, the process 800 may be carried out in any desired environment.
  • a search query containing terms is received in step
  • step 804 at least some of the terms are looked up in a search index for identifying sections of documents containing the at least some of the terms. This may include identifying paragraphs in the sections containing the at least some of the terms and calculating a paragraph score for each of the paragraphs based at least in part on a number of the terms appearing in each of the paragraphs, wherein a section score is calculated based on the paragraph scores of the paragraphs in the section. 100178] Further, in step 806 a content score is generated for each of the documents based at least in part on a number of keywords found in the sections of each document. The content score may reflect all matches in the document, or the highest section score or scores in one or more of the sections.
  • Weighting may be applied to each keyword found in the sections of the documents, where the weighting affects the content score.
  • at least some of the terms in the search index are looked up for attempting to match one or more of the terms to context information in the search index, the context information being associated with at least one of the documents. Weighting may be applied to each keyword matching the context information, where the weighting affects the context score.
  • a context score is generated based at least in part on the matching of terms to the context information. This may include the case where the context score is zero if none of the terms match context information.
  • a document score is generated for each of the documents based at least in part on the content score and the context score. The document score may be calculated based at least in part on the sections scores of the sections of the documents.
  • an indicator of at least one of the documents, or portion thereof is output for the at least one of the documents having a higher document score relative to other of the documents. Additionally, an indicator of at least one of the sections having a higher paragraph score relative to other of the sections may be output. The indicator may be of a section of the at least one of the documents.
  • step 226 the system can submit user search queries to locate the right document and the sections within the document.
  • the application server would construct a query based on the given terms.
  • a brown fox jumps over the lazy dog. is processed through the following steps:
  • a query recommender may be utilized. When user makes mistakes in entering the query, they may not get the expected results. The mistake may be a result of misspelled words or imprecise model numbers. A query recommender tries to find good alternatives in these circumstances. For example, the query recommender may be used to correct product model numbers. [00187] In one embodiment, the query recommender may correct a single unmatched term. When a single query term does not match any document, Query Recommender shall find alternatives to that term in product model numbers. For example, suggest "canon A40" for "canon A45". In another embodiment, the query recommender may find a closer model number. When all terms match some document, the query recommender shall take a content term and find product model alternatives.
  • the query recommender may suggest alternatives with too many results. For example, queries like "sony 100" may produce many matches. The query recommender shall suggest alternatives so that user can submit better queries to get more relevant results. Further still, the query recommender may correct misspelled queries in recommendations and should return a recommendation in a reasonable amount of time because it adds to the duration of the search.
  • recommended queries may be close to the original query. They should constitute an improvement to be displayed. For example, it should not be a duplicate of the original query, or they should not appear entirely unrelated in any shape or form to the original. Other embodiments may also address integration and priority issues.
  • the query recommendation process may begin with examining the PDF search results for the original query 902. The terms for replacement are identified. Based on rest of the matched terms in the query, a dictionary is constructed. Additionally, words from the dictionary are sought that are closest to the replacement terms. If the proximity of the candidates passes certain thresholds, the model numbers corresponding to these candidates are returned. Further, the query is reconstructed by replacing old terms with suggestions, and displaying the end results to the user.
  • the query may be sent to a spelling suggestion web service 924, e.g. Yahoo! Spelling Suggestion web service. This mainly fixes spelling errors, but also includes commonly- used vendor or family names and other phrases.
  • a spelling suggestion web service 924 e.g. Yahoo! Spelling Suggestion web service. This mainly fixes spelling errors, but also includes commonly- used vendor or family names and other phrases.
  • the process does yield one or more results in step 920, in step 922 the top three results are chosen to return to the user.
  • the top result from position search 904 is used to determine whether query recommendation is kicked off. From the top result's match masks, it is determined in step 906 which query term matches the vendor or the family, which term matches a product model number, and which term, if any, does not match any document.
  • the top result's unmatch mask may identify the unmatched terms.
  • step 912 Counting these occurrences and if the count is 1, it can be determined in step 908 that a single term does not match any document. This term to be replaced is added in step 912. [00192] If the top result's unmatch mask is 0 in step 910, all terms have matched some document. Matched terms are then placed into two groups in step 914: (1) product terms - terms that match vendor, family, and/or models, and (2) content terms - terms that match the content of the PDF document. This may be done by looking at vendor, family, full, partial, alpha match masks of the top result. If a term is not matched according to any of these masks, it is a content term. Content terms to be replaced are added. 100193] Neighboring terms (or biwords) in a query often offer stronger contextual semantics. The terms to replace may be decided as follows:
  • a dictionary provides a collection of words from which candidates are selected for recommendations.
  • the dictionary may be formed in optional step 916 based on the following constraints, whichever is known:
  • the dictionary may be pre-defined or pre- constructed.
  • step 918 For each term to replace, in step 918 an alternative is determined from the dictionary based on a proximity algorithm.
  • the algorithm assumes as input a list of dictionary terms (known model names that may consist of full model name, alphanumeric or alpha only model parts, etc.), and the query term that needs a recommendation.
  • the output is a sorted list of recommended terms, the models each recommendation represents, and a score (lower the better) for each recommendation.
  • the steps of the algorithm are as follows:
  • each dictionary term is converted into two feature vectors: (i) histogram of alphanumeric character count (counts number of a, b, ... , z, 0, 1 , ... , 9); and (ii) bi-character and tri- character histogram represented as hashmap (referred to as multi-character histogram).
  • each bucket of the histogram is converted into a integer value and its count is stored in the hashmap.
  • the distance consists of a distance score weighting the following:
  • the query string as the user enters is parsed for performing the search.
  • the main transformations are: stop words such as "a” are removed words separated by punctuations are broken up. E.g., "dcr-hc20” becomes “dcr hc20" neighboring words are concatenated to form biwords and are appended [00202]
  • the right term(s) may be replaced in the original query and other terms kept untouched. This may be achieved as follows: get a list of query term tokens. These tokens are saved during query string parsing.
  • the token's term position in the parsed query is -1 or it is not at the position to be replaced the token is not part of the suggestion the token is outside the char position range found in step 2 the token has not been previously added otherwise add the suggestion to the output if the following are all true: the token's term position in the parsed query is the position to be replaced the token is within the char position range found in step 2 the replacement has not been previously added
  • the content of the document is stored inside the index 224.
  • Each word from the content is further tagged with the section and paragraph from which the word comes from.
  • the query is processed to retrieve the matching documents from the search index 224.
  • FIG. 10 illustrates an example of a traditional lookup 1002, merge 1004, and sort 1006 which may be implemented in some embodiments.
  • a search engine may perform a look up 1002 for the term given in the query in the index, and then return a list of document ids in ascending order for that term. Then a merge process 1004 is used to combine the terms matched for each document together to form a score based on how many terms matched for each document as well as other information such as the term frequency in that document and the overall term frequency across all documents.
  • FIG. 11 An example of a preferred embodiment for content lookup, merge, and sort is shown in FIG. 11.
  • a unique lookup 1102, merge 1104, and sort process 1106 that take into consideration the section and paragraph information may be used.
  • each term is looked up against the search index to find the list of documents that contains the term.
  • the lookup process returns the "section id, paragraph number" for each term as well. Since the indexing process encodes the section id and paragraph number into a 32-bit position id value, a list of ⁇ document id, position id> integer pairs is returned in ascending order.
  • the merge process 1104 all terms appearing in the same paragraph are combined to form a local paragraph score, and then all paragraphs from the same section are combined to form a section score. Finally, the section scores from the same document are used to produce a document score.
  • the search result is still sorted using sort process 1106 by the final document score as before. But for each document, not only is the score for that document produced, but also the list of sections and for each section, a section score and the top 3 paragraphs (or more or less) that have the best match for that section are stored.
  • flags indicating which term has matched for this document in the result is returned. These flags can be used by the application to further refine ranking, create query recommendation, and control display.
  • a score is generated for each matching document during the merge process. This score may be built up piece by piece using the following illustrative process or variations thereof.
  • [00215] Create a section score by counting the total number of different keywords matched from all paragraphs of that section, and then adjust it using the scores from the top 3 paragraphs. Also adjust the score by taking into account any matches from paragraph 0. Paragraph 0 indicates that the title of the section matched. Also, the score can be adjusted by counting how many bi-words matched inside this section as well.
  • the result of the document scoring is a set of object containing the following information for each document score:
  • a document id [0001] A document overall score [0002] A list of section scores. For each section score:
  • context matching is done at the same time as content matching because the context information is stored inside the same index, with the term position set to section 0. There is no additional logic required to figure out if a term matched inside the context.
  • context scoring is done by first determining if a match is for the context. This is easily implemented by checking the section number of the match for a document. If a match results in section 0, then it is for the context. Then, based on the matching paragraph id, it can be determined which one of the meta data the term matched in. For example, if the term "sony” produces a match on section 0 and paragraph 0, then it is known that "sony” is a vendor term for this document. However, if the term "sony” produces a match on section 12, paragraph 3, then the document is not about sony, but the word sony is mentioned inside the content of the document [00220] Further, during context scoring, score values for the following meta data fields are produced:
  • taxonomy matching may be performed as part of or separate from context scoring utilizing taxonomy information 232.
  • the value given to a term matches in the meta field is generally greater than the same match found in the content field. For example, if the term "manual" matched the "Document Type" field, then this document may get a higher score than another document that has this term matched only in its content.
  • the meta fields contain special words that have strong semantics for a document. By leveraging these special terms inside the meta field, not only is a better and semantically more relevant ranking created across documents, but better ranking is also produced within the sections of the same document.
  • One of the query term is the word " len".
  • "Jen” is a special term that can only match inside the context meta data. There is only one len term for a document. This term exists in section 0, paragraph 10 or above. During the context scoring, the paragraph id of the match for "_len” is taken and subtracted by 10. The resulting number is the encoding full model number length. The full model length is used to assist in computing the score value for the full and partial model match.
  • results of a search 234 may be post processed to improve the results.
  • a multi-stage post processing may be employed to efficiently and effectively filter out poor results or boost more relevant results.
  • a poor result is defined as a result without a good product match resulting from either a low product score or an unintended product match for a very generic alphabetic query.
  • the PDF search results are post processed by filtering out results with poor product matches, and re-ranking results based on document type.
  • the result 234 is a list of documents ordered by the search result score.
  • the search result score is a combination of the content score and context score.
  • Content score is the score given to the document based on keyword matches inside the sections and paragraphs of the document content.
  • Context score is based on the keyword matches inside the meta data about the document. Meta data includes items such as vendor, model, family, title, and document type, and may include taxonomy-related information.
  • results with poor product matches are filtered out.
  • one of the assumptions concerning the PDF search results is that a PDF document should not be returned unless the product model is relevant to the user query. Irrelevance of the PDF document can occur either due to a mismatch (e.g., all query terms match well in the content, but don't match a particular product) or due to a generic, non- product-specific user query (certain words in the query match a product, but these are not specific enough).
  • the former case may be handled via a threshold on the difference between consecutive document scores. If the product score difference falls below a threshold then all documents below the current doc are filtered out.
  • a threshold on the difference between consecutive document scores. If the product score difference falls below a threshold then all documents below the current doc are filtered out.
  • product such as "digital camera solution disk”
  • further checks may be employed. For instance, if there isn't a vendor or family match then there is an alphanumeric product model match for the product model to be considered relevant to the query.
  • results may be re-ranked based on doctype.
  • docTypeBoost a specific weight is added to the document types. This boost is referred to as the docTypeBoost.
  • section summary reconstruction may be performed. For example, when a document is returned as a match, the section summary is displayed from that document.
  • the user query is about selecting the document as a whole, rather than searching for items within the document. For example, if the user query is "Sony dvdlOl user guide", then the user is probably searching for the entire document. If the user query is "Sony dvdlOl focus settings", then the user is probably searching for the section in the document about focus settings.
  • the search engine may return sections within a matched document, these sections may not be relevant for display if the user query is about the entire document. Rather, each document is preferably post processed in the result set with the following logic to detect this situation.
  • the match masks are used to see if all terms of the query appear in vendor, family, model, title, and document type. If they do, the document's matching section is changed to include section 1, which is the first chapter, and optionally a section with the title including the keyword "Specification.”
  • query results for searches for product documents displays the title page of the document as the first section and the specification section (if found) as the second section. The original matched sections from the search engine are ignored.
  • Figure 12 is a flow diagram of a method 1200 for processing a search query in accordance with one embodiment of the present invention.
  • the process 1200 may be implemented in the context of the architecture and environment of FIGs. 1-11. Of course, however, the process 1200 may be carried out in any desired environment.
  • step 1204 at least some of the terms are looked up in a search index for identifying sections of documents containing the at least some of the terms. Additionally, in step 1206, a content score is generated for each of the sections based at least in part on a number of keywords found in the sections of each document. It may also be determined whether the search query includes a product identifier or portion thereof associated with a product, and if so, the documents not relating to the product associated with the product identifier may be filtered out. Further, in step 1208 an indicator of at least one of the sections, or portion thereof, is selected and output based at least in part on the content score.
  • An indicator of a paragraph of at least one of the sections may also be selected and output, where the selecting and outputting of the indicator of the at least one of the sections, or portion thereof, may be based at least in part on types of the documents. It may also be determined whether the search query is directed to an entire document rather than one or more sections thereof, and if so, an indicator of the document may be selected and output instead of the at least one of the sections, or portion thereof, of the document. [00245] Additionally, a search may be performed for at least some of the terms in the search index in order to attempt to match one or more of the terms to context information in the search index, where the context information is associated with at least one of the documents.
  • a context score may also be generated based at least in part on the matching of the terms of the context information, where the selection of the at least one of the sections, or portion thereof, is also based at least in part on the context score.
  • an index structure for keyword searches is presented, the index structure being embodied on a computer readable medium, e.g. a hard disk, a magnetic tape, ROM, RAM, optical media, etc.
  • the index structure comprises a plurality of content words.
  • the index structure comprises, for each of the content words, at least one document identifier, e.g. an id, containing information about a document containing the content word.
  • the index structure further comprises at least one position identifier containing information about a section in the document containing the content word.
  • at least some of the position identifiers may further contain information about a paragraph in the section of the document containing the content word.
  • at least some of the position identifiers may include a weighting value of the content word. Further still, the weighting value may be based at least in part on a position of the content word in the document.
  • the index structure may further comprise context meta data associated with at least some of the documents, where the context meta data indicates a context of the documents associated therewith. Additionally, at least some of the context meta data may be weighted.
  • a search WEB portal may provide an interface for users to enter product queries in a Web browser. After the query is entered, the search results are displayed. Users can navigate the result pages using various hyperlinks to see more results, preview site, as well as submit additional queries.
  • the portal provides unique features such as quick preview, dynamic navigation, and persistent states. Further, the portal provides simple query input control, like other search engines, displays the title, url, and summary of search results, and displays search results in channels. Also, the portal provides channel drill down to see more results, enables users to quickly preview selected search results, provides reasonable "fast" response time, and allows customization of the display.
  • the portal may support any web browser, for example, Internet Explorer 6+ and Firefox 1.5+ on WINDOWS ® 2000/XP and Safari 1.2+ on Mac OS X 10.2+.
  • Fig. 13 is a landing page 1300 in accordance with one embodiment of the present invention.
  • the landing page 1300 may be implemented in the context of the architecture and environment of FIGs. 1-12.
  • the process 1300 may be carried out in any desired environment.
  • the logo 1302 displays the company logo.
  • the tagline 1304 displays the company tagline.
  • the tagline can change dynamically by editing a template file without restarting the server.
  • the user input element 1306 is an entry box used by the user to enter the query for the search.
  • the examples element 1308 is an area which contains example queries to educate the user on how to use the system. Like the tagline, this area is dynamic and can change without restarting the server.
  • the other information element 1310 is an informational area used to communicate with the user. This area can also be dynamically updated without restarting the server.
  • the footer element 1312 contains a list of hyperlinks to pages such as about us, terms of use, privacy policy, and feedback. 100254)
  • a possible landing page implementation 1400 is shown in FIG. 14, the various portions of which are self explanatory. [00255) Search Result Page
  • FIG. 15 is a search result page 1500 in accordance with one embodiment of the present invention.
  • the search result page 1500 may be implemented in the context of the architecture and environment of FIGs. 1-14. Of course, however, the search result page 1500 may be carried out in any desired environment.
  • the logo element 1502 is a smaller version of the company logo. Clicking on the logo brings the user back to the landing page.
  • the user input element 1504 allows the user to enter another query to search without going to the landing page.
  • the query recommendation element 1506 displays the query recommendation after a search. If there is no query recommendation, then this area is left blank.
  • the other controls element 1508 displays control buttons, for example, "invite friends," "submit feedback,” etc.
  • the main area of the display is divided into a left hand side (LHS) and a right hand side (RHS).
  • LHS left hand side
  • RHS right hand side
  • the two sides are resizable with a splitter in the middle. Additional controls are also available to close the RHS or expand the RHS.
  • the channel listing element 1510 displays the channels under which the data is displayed. For example, channels may be labeled “Top Results,” “Product Documents,” “Forums & Blogs,” “Reviews & Articles,” “Manufacturer Info,” “Stores,” and “Other.”
  • the search results element 1512 is the main display of search results.
  • Each result is made of a title, a summary, and a URL link to the full data. Pressing in the body of the summary brings up a preview of the full data in the RHS.
  • a search result may be for a Web page or for a section of a PDF document.
  • the search results changes based on the selected channel in the channel listing element 1510.
  • the footer element 1514 contains hyperlinks to web pages, for example, "about us,” “terms of use,” “privacy statement,” and "feedback.”
  • the RHS preview element 1516 displays the selected search result from the search results area.
  • the displayed page can be either a PDF document page or the content from a Web site.
  • the preview area is a great way to quickly review the search results without losing the left hand side results.
  • wide aspect ration monitor becomes more common, there is enough horizontal space on the screen to show both the search result and the preview. For users who like a traditional way of viewing the search results without the preview, they can close the preview area entirely.
  • FIG. 16 An example of an implementation of the search results page 1500 displaying a PDF document page 1602 is shown in FIG. 16. Another example of an implementation of the search results page 1500 displaying content from a web site is shown in FIG. 17.
  • the user interface may be implemented using Java
  • FIG. 18 One preferred embodiment of the process of submitting a query and display results between the client browser and the server is shown in FIG. 18.
  • a user enters a query, and a PDF search and query recommendation are performed in step 1804.
  • the PDF results are rendered in step 1806, and in step 1808 a request is sent to AJAX for web results.
  • a web search is performed in step 1810. Further, the web results are rendered in step 1812.
  • a request is sent to AJAX for a preview.
  • Preview content is constructed in step 1816, and in step 1818 it is determined if the preview is a PDF page preview. If it is, in step 1820 a request is sent for the PDF page, and in step 1822 the PDF page is sent and is rendered in preview RHS in step 1828. If the preview is not a PDF page preview, in step 1824 a request is sent to a web site, and in step 1826 the web page is sent and is rendered in Preview RHS in step 1828. [00266] Displaying PDF Page Preview
  • the PDF page preview may be a single PDF page downloaded from our server for display inside the RHS preview pane. Since this page is not HTML, the browser may use a PDF plugin to display the PDF page. Browsers that do not have a PDF plugin may not be able to preview the PDF page. One potential way of resolving that issue may be to generate a graphical image of the page on the server and only serve the resulting image file to the browser. Since most browser supports image display, the latter approach may provide broad compatibility. [00268] Displaying Web Site Preview
  • the Web site preview is rendered entirely by the web browser.
  • the browser submits a HTTP request directly to the web site referenced in a search result.
  • the web site is then displayed inside the RHS area in an internal frame.
  • the internal frame is further adjusted such that a zoom factor is applied. As the user move the slider to expand and shrink the RHS window, the rendered web site content zooms in and out accordingly.
  • Displaying web site inside an internal frame has an effect in that some web site uses JavaScript to detect if it is being rendered inside an internal frame. If it is, it would redirect the browser go the site and display the site content inside the root window. The user interface code does its best to detect this behavior. Once detected, the client side JavaScript notifies the potential problem site with the server. Later, it is verified that the site does have this behavior. If it does, the site is added to a blacklist.
  • This situation may be addressed by deploying a Web browser plugin.
  • the plugin may render the given web site in the RHS internal frame. Because the rendering is done by the plugin, the web site is shown in a "top" level window.
  • WINDOWS ® platform can be easily created by using ActiveX and loading a
  • WebBrowser control that is built into the operating system. Using the WebBrowser control can also provide zoom in/out capabilities. For other platforms, it may be determined how the plugin can be easily implemented.
  • a combination approach may be taken.
  • the existing method may be used with certain sites blacklisted.
  • web preview of all sites may be provided.
  • Having the plugin may also allow the implementation of keyword highlighting inside the web page for the user.
  • the user interface may track the preview result location, the show/hide of the RHS preview window 1516, and the left to right split ratio.
  • Preview result location is maintained as the user navigate away from the search result page and then use the browser's back button to come back.
  • the page automatically select the last previewed result.
  • the show/hide and left-to-right split ratio are remembered persistently for the user's browser.
  • Server side persistence may also be implemented for user interface states.
  • server site persistence allows the user interface preferences to be transferred across different browsers. Server based persistence would require the user to sign up an account.
  • AJAX may be used in the user interface to dynamically load data into various frames. Using AJAX gives the user a feeling of faster response time.
  • the results for the PDF portion of the search are displayed first and quickly, and then the Web search results are displayed.
  • web browsers may not support AJAX. Examples of such browsers include Cellphone/PDA, older versions of desktop browsers, and search engine crawlers. In these situations, a combination of techniques may be used. For browsers supporting AJAX, asynchronous data loading may still be used. For other browsers, a traditional technique of constructing the entire search result content, which includes PDF results and Web results, on the server, and then sending that data to the browser, may be used.
  • FIG. 19 is a flow diagram of a method 1900 for indexing a product identifier and logical parts thereof in accordance with one embodiment of the present invention.
  • the method 1900 may be implemented in the context of the architecture and environment of FIGs. 1-18. Of course, however, the method 1900 may be carried out in any desired environment.
  • a product identifier is received in step 1902.
  • the product identifier is split into logical parts in step 1904. If the product identifier is an alphanumeric character string, the logical parts may include an alphabetic part and a numeric part of the alphanumeric character string. Further, in step 1906 the product identifier and the individual logical parts in association with a particular document or portion thereof are indexed in an index, and the index is stored in step 1908. Also, if the product identifier comprises multiple logical parts separated by a space, punctuation mark, etc., at least some of the logical parts may be indexed as a single consecutive character string.
  • the product identifier may be indexed in a field for full terms, whereas the logical parts may be indexed in a field for partial terms. If the logical parts include an alphabetic part and a numeric part of the alphanumeric character string, the alphabetic part and the numeric part may be each indexed in a field for partial strings, and/or the alphabetic part may be indexed in a field for alphabetic strings.
  • a model number is split into parts and stored 3 areas: full, partial, and alpha-only.
  • splitting logic is as follows: Loop 1:
  • Loop 1 Another example of splitting logic, where parameter minLen no longer plays a role, is as follows: Loop 1 :
  • FIG. 20 is a flow diagram of a process 2000 for indexing a product identifier arid variations thereof in accordance with one embodiment of the present invention. As an option, the process 2000 may be implemented in the context of the architecture and environment of FIGs. 1-19. Of course, however, the method 2000 may be carried out in any desired environment.
  • a product identifier is received in step 2002.
  • the product identifier is split into logical parts in step 2004. Further, in step 2006 the product identifier and alternate combinations of the logical parts in association with a particular document or portion thereof are indexed in an index, and the index is stored in step 2008. Also, the product identifier may be indexed in a field for full terms, whereas the alternate combinations may be indexed in a field for partial terms.
  • a method for processing a search query is presented. In use, a search query containing one or more terms is received. Further, a search index containing complete product identifiers and variations thereof is searched for attempting to match the one or more terms to the product identifiers or the variations thereof.
  • the variations may include a partial product identifier, a reordered product identifier, a modified product identifier, etc. Additionally, if one or more of the terms matches a complete product identifier or variation thereof, an indicator of the document or a portion thereof associated with the matching product identifier is selected and output. If one or more of the terms does not match a complete product identifier or variation thereof, an attempt may be made to make a best match between the one or more of the terms and the product identifiers and variations thereof, and possible matches may be output for user selection.
  • the variations of the product identifiers may include at least one of: parts of the product identifiers, continuous character strings, reordered logical parts of the product identifiers, alphabetical characters only, and numerical characters only.
  • code as used herein, or “module”, as used herein, may be any plurality of binary values or any executable, interpreted or compiled code which can be used by a computer or execution device to perform a task.
  • This code or module can be written in any one of several known computer languages.
  • a “module,” as used herein, can also mean any device which stores, processes, routes, manipulates, or performs like operation on data.
  • An “incoming communication device” and “outgoing communication device” may be any communication devices which can be used for taking fax information and inputting the fax information into a module.
  • a "text file” or “textual format”, as used herein, may be any data format for efficiently storing alphanumerical data.
  • a text file or text format is any data structure which identifies individual alphanumeric characters letters, or language characters from any faxed transmission.
  • a "string”, as used herein, is one or more alpha numeric or textual characters which are identified as being part of a group (such as a human name). It is to be understood, therefore, that the various embodiments of this invention are not limited to the particular forms illustrated and that it is intended in the appended claims to cover all possible modifications of the teachings herein.
  • various embodiments discussed herein are implemented using the Internet as a means of communicating among a plurality of computer systems.
  • One skilled in the art will recognize that the present invention is not limited to the use of the Internet as a communication medium and that alternative methods of the invention may accommodate the use of a private intranet, a LAN, a WAN, a PSTN or other means of communication.
  • various combinations of wired, wireless (e.g., radio frequency) and optical communication links may be utilized.
  • the program environment in which a present embodiment of the invention may be executed illustratively incorporates one or more general-purpose computers or special-purpose devices such facsimile machines and hand-held computers. Details of such devices (e.g., processor, memory, data storage, input and output devices) are well known and are omitted for the sake of clarity.
  • the techniques presented herein might be implemented using a variety of technologies.
  • the methods described herein may be implemented in software running on a computer system, or implemented in hardware utilizing either a combination of microprocessors or other specially designed application specific integrated circuits, programmable logic devices, or various combinations thereof.
  • methods described herein may be implemented by a series of computer-executable instructions residing on a storage medium such as a carrier wave, disk drive, or computer-readable medium.
  • Exemplary forms of carrier waves may be electrical, electromagnetic or optical signals conveying digital data streams along a local network or a publicly accessible network such as the Internet.
  • specific embodiments of the invention may employ object-oriented software programming concepts, the invention is not so limited and is easily adapted to employ other forms of directing the operation of a computer.
  • Various embodiments can also be provided in the form of a computer program product comprising a computer readable medium having computer code thereon.
  • a computer readable medium can include any medium capable of storing computer code thereon for use by a computer, including optical media such as read only and writeable
  • CD and DVD compact discs
  • magnetic memory e.g., compact discs, etc.
  • semiconductor memory e.g., FLASH memory and other portable memory cards, etc.
  • software can be downloadable or otherwise transferable from one computing device to another via network, wireless link, nonvolatile memory device, etc.
  • FIG. 21 illustrates a network architecture 2100, in accordance with one embodiment.
  • a plurality of remote networks 2102 are provided including a first remote network 2104 and a second remote network 2106.
  • a gateway 2107 may be coupled between the remote networks 2102 and a proximate network 2108.
  • the networks 2104, 2106 may each take any form including, but not limited to a LAN, a WAN such as the Internet, PSTN, internal telephone network, etc.
  • the gateway 2107 serves as an entrance point from the remote networks 2102 to the proximate network 2108.
  • the gateway 2107 may function as a router, which is capable of directing a given packet of data that arrives at the gateway
  • At least one data server 2114 coupled to the proximate network 708, and which is accessible from the remote networks 2102 via the gateway
  • the data server(s) 2114 may include any type of computing device/groupware. Coupled to each data server 2114 is a plurality of user devices 2116.
  • Such user devices 2116 may include a desktop computer, lap-top computer, hand-held computer, printer or any other type of logic. It should be noted that a user device 2117 may also be directly coupled to any of the networks, in one embodiment.
  • a facsimile machine 2120 or series of facsimile machines 720 may be coupled to one or more of the networks 2104, 2106, 2108.
  • a network element may refer to any component of a network.
  • FIG. 22 shows a representative hardware environment associated with a user device 2116 of FIG. 21, in accordance with one embodiment.
  • a user device 2116 of FIG. 21 illustrates a typical hardware configuration of a workstation having a central processing unit 2210, such as a microprocessor, and a number of other units interconnected via a system bus 2212.
  • a central processing unit 2210 such as a microprocessor
  • a number of other units interconnected via a system bus 2212.
  • the workstation shown in FIG. 22 includes a Random Access Memory
  • RAM Random Access Memory
  • ROM Read Only Memory
  • I/O adapter 2218 for connecting peripheral devices such as disk storage units 2220 to the bus 2212
  • a user interface adapter 2222 for connecting a keyboard 2224, a mouse 2226, a speaker 2228, a microphone 2232, and/or other user interface devices such as a touch screen and a digital camera (not shown) to the bus 2212
  • communication adapter 2234 for connecting the workstation to a communication network 2235 (e.g., a data processing network)
  • a display adapter 2236 for connecting the bus 2212 to a display device 2238.
  • the workstation may have resident thereon an operating system such as the Microsoft Windows® Operating System (OS), a MAC OS, or UNIX operating system.
  • OS Microsoft Windows® Operating System
  • MAC OS MAC OS
  • UNIX operating system UNIX operating system
  • a preferred embodiment may also be implemented on platforms and operating systems other than those mentioned.
  • a preferred embodiment may be written using JAVA, XML, C, and/or C++ language, or other programming languages, along with an object oriented programming methodology.
  • Object oriented programming (OOP) which has become increasingly used to develop complex applications, may be used.
  • FIG. 23 illustrates a network architecture 2300, in accordance with one embodiment.
  • a plurality of networks 2302 is provided.
  • the networks 2302 may each take any form including, but not limited to a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, peer-to-peer network, etc.
  • LAN local area network
  • WAN wide area network
  • peer-to-peer network etc.
  • FIG. 23 Coupled to the networks 2302 are servers 2304 which are capable of communicating over the networks 2302. Also coupled to the networks 2302 and the servers 2304 is a plurality of clients 2306. Such servers 2304 and/or clients 2306 may each include a desktop computer, lap-top computer, hand-held computer, mobile phone, smart phone and other types of mobile media devices (with or without telephone capability), personal digital assistant (PDA), peripheral (e.g. printer, etc.), any component of a computer, and/or any other type of logic. In order to facilitate communication among the networks 2302, at least one gateway 2308 is optionally coupled therebetween.
  • Figure 24 shows a representative hardware environment that may be associated with the servers 2304 and/or clients 2306 of Figure 23, in accordance with one embodiment.
  • FIG. 24 Such figure illustrates a typical hardware configuration of a workstation in accordance with one embodiment having a central processing unit 210, such as a microprocessor, and a number of other units interconnected via a system bus 212.
  • a central processing unit 210 such as a microprocessor
  • FIG. 24 The workstation shown in Figure 24 includes a Random Access Memory
  • RAM Random Access Memory
  • ROM Read Only Memory
  • I/O adapter 2418 for connecting peripheral devices such as disk storage units 2420 to the bus 2412
  • a user interface adapter 2422 for connecting a keyboard 2424, a mouse 2426, a speaker 2428, a microphone 2432, and/or other user interface devices such as a touch screen (not shown) to the bus 2412
  • communication adapter 2434 for connecting the workstation to a communication network 2435 (e.g., a data processing network) and a display adapter 2436 for connecting the bus 2412 to a display device 2438.
  • a communication network 2435 e.g., a data processing network
  • display adapter 2436 for connecting the bus 2412 to a display device 2438.
  • the workstation may have resident thereon any desired operating system.
  • an embodiment may also be implemented on platforms and operating systems other than those mentioned.
  • One embodiment may be written using JAVA, C, and/or C++ language, or other programming languages, along with an object oriented programming methodology.
  • Object oriented programming (OOP) has become increasingly used to develop complex applications.
  • Figure 25 shows a method 2500 for generating value-based information, in accordance with one embodiment.
  • the method 2500 may be carried out in the context of the architecture and environment of Figures 23 and/or 24. Of course, however, the method 2500 may be carried out in any desired environment.
  • operation 2502 under control of a computer and/or manually, statistical data is generated for particular features of a plurality of products based on prices of the products. Additionally, in operation 2504 a base score for each of the features is generated based on the statistical data.
  • a product feature score is computed for the product based on the base scores of the features that the product has.
  • a representation of a value of each of the at least some of the products in relation to each other is output, where the representation of the value is based on the product feature score and the price for each of the products.
  • Figure 26 illustrates a method 2600 for displaying product information, in accordance with one embodiment.
  • the method 2600 may be implemented in the context of the architecture and environment of Figures 23-25.
  • the method 2600 may be implemented in any desired environment.
  • the aforementioned definitions may apply during the present description.
  • a feature to price distribution is approximated for each of a plurality of features of a plurality of products. Additionally, in operation
  • a product feature score is computed for each of at least a subset of the products.
  • data corresponding to a visual representation of the at least a subset of the products in relation to each other is output based on the product feature scores and prices of each of the at least a subset of the products.
  • Figure 5 illustrates a method 2700 for displaying product information, in accordance with another embodiment.
  • the method 2700 may be implemented in the context of the architecture and environment of Figures 23-26.
  • the method 2700 may be implemented in any desired environment.
  • the aforementioned definitions may apply during the present description.
  • a value is assigned to each of a plurality of features of a plurality of products. Additionally, in operation 2704 a product feature score is computed for each of at least a subset of the products.
  • data corresponding to a visual representation of the at least a subset of the products in relation to each other is output based on the product feature scores and prices of each of the at least a subset of the products.
  • Figure 28 illustrates a method 2800 for displaying product information, in accordance with yet another embodiment.
  • the method 2800 may be implemented in the context of the architecture and environment of Figures 23-27.
  • the method 2800 may be implemented in any desired environment.
  • the aforementioned definitions may apply during the present description.
  • a value of each of a plurality of products relative to the other products is determined, where the values are based on features and prices of the products. Further, in operation 2804 data corresponding to a visual representation of the products in relation to each other is output based on the value of the products in relation to each other.
  • Figure 29 illustrates a method 2900 for displaying product information, in accordance with still yet another embodiment.
  • the method 2900 may be implemented in the context of the architecture and environment of Figures 23-28. Of course, however, the method 2900 may be implemented in any desired environment. Yet again, it should be noted that the aforementioned definitions may apply during the present description.
  • a value of each of a plurality of products relative to the other products is determined, where the values are based on features and prices of the products.
  • data corresponding to a visual representation of the products in relation to each other is output on a plot of features vs. product price based on the value of the products in relation to each other.
  • user input specifying a subset of the products is received, and data corresponding to a visual representation of the subset of products is output.
  • statistical data may include any data that is statistical in nature or based on statistical data of any type.
  • statistical data may include value data.
  • statistical data may be plotted on a graph.
  • the statistical data may be represented as a function of the number of products containing the feature vs. the price of the product containing the feature.
  • the plurality of products may include any product available for purchase by a customer.
  • the products may include automobiles, televisions, insurance, etc.
  • the features of the plurality of products may include any features of the product.
  • the features of the product may include screen size, screen resolution, weight, etc.
  • the features of the product may include size, efficiency, color, etc.
  • features may include not only physical or operational features of the products, but also intangibles such as manufacturer, market buzz, e.g. as reflected in commercial publications/web pages, prestige, estimated reliability, etc.
  • the price of the products may include any monetary value for which the product may be sold.
  • generating the statistical data may include, for a particular product feature, associating each of the products with at least one of a plurality of price bins based on an actual price of the product; and, for each price bin, determining a number of products having the particular product feature.
  • generating the statistical data and/or approximating the feature to price distribution may include, for a particular product feature, selecting a plurality of price bins; and, for each price bin, determining a number of products in each price bin having the particular product feature.
  • generating the base score for each of the features based on the statistical data may include using the statistical data itself.
  • generating the base score may include determining a mean of the statistical data.
  • generating the base score may include determining a standard deviation of the statistical data.
  • generating the base score for each of the features based on the statistical data may include using the mean and the standard deviation of the statistical data.
  • the base score may include a monetary value.
  • computing the product feature score may include summing the base scores of the features that the particular product has. Additionally, in one embodiment, each of the base scores may be given a weighting prior to the summing. Further, in another embodiment, the weighting may be based on at least one of a standard deviation of a feature to price distribution for each of the features of the products, a manually-defined value, and a statistically computed value based at least in part on prices of the products. Further, in one embodiment the product feature score may include a final feature for the product.
  • computing the product feature score for a particular one of the products may include summing statistical derivatives of the feature to price distributions of the features of the particular product. Additionally, in another embodiment, each of the statistical derivatives may be given a weighting. In yet another embodiment, the weighing may be based on at least one of a standard deviation of the feature to price distribution, a manually-defined value, and a statistically computed value.
  • the representation of the value of each of the at least some of the products in relation to each other may include display data. In another example, the representation of the value of each of the at least some of the products in relation to each other may include data for use by another process which may ultimately output something based on the data. In yet another example, the representations of the values of the at least some of the products in relation to each other may be plotted on a chart of price vs. features.
  • determining a value of each of the plurality of products relative to the other products may include computing the values as set forth herein.
  • the value can be simply retrieved or received from a database or third party.
  • any portions and/or combinations of the above techniques may be used in obtaining the values.
  • the data corresponding to a visual representation of the products in relation to each other based on the value of the products in relation to each other may be raw display data, data for transmission to a remote computer (e.g., HTML, XML, etc.), or any other type of data that can be manipulated or converted for display.
  • a remote computer e.g., HTML, XML, etc.
  • various embodiments of the present invention may be referred to individually and collectively as "product snapshots", which relate to the visual presentation of product information.
  • One goal of the product snapshot is to quickly give the user a high-level understanding of a product to which the snapshot relates.
  • product snapshots include visual methods of presenting key facts surrounding a product. This information may be objective.
  • a product snapshot may be useful in many cases. For example, a user may want to know what kind of product they are searching for before spending more time researching it. In another example, a user may encounter a deal for a product, and may want to obtain a quick understanding of that product to better evaluate the deal. In still another example, a summary may be syndicated to a partner's product page in order to complement it.
  • a user may want to learn about most or all of a particular group of products in a quick and efficient manner.
  • it may be desirable to utilize the product snapshot as a method of navigating from a category to products of interest.
  • the product snapshot may be easy to understand.
  • the product snapshot may be presented in a simple manner.
  • the product snapshot may enable a user to view the product snapshot and immediately view the price vs. features of a product, enabling the user to determine whether the product is high value.
  • the product snapshot may provide the user with minimal text relative to the known total amount of product information. In this way, users of the product snapshot are not overwhelmed, as more information may be available to the user that is hidden at the primary viewing level but that can be viewed at another level.
  • the product snapshot may be standardized across all products and product categories. In this way, a consistent look and presentation may be maintained.
  • the product snapshot may be versatile in that the same methodology and presentation may work for any subset of products. For example, a product snapshot may be presented for a category, a subset of categories, different categories, etc. [00346] Design
  • FIG. 30 One exemplary embodiment is illustrated in Figure 30.
  • one or more product facts, product features, etc. are chosen to be presented using a box diagram
  • Box diagram 3000 displays four dimensions of information.
  • the box diagram 3000 includes the relative feature level of the product 3002. Additionally, the second dimension (x-axis) displayed by the box diagram 3000 includes the relative price level of the product 3000.
  • the third dimension displayed by the box diagram 3000 includes the popularity level of the product 3002, which may be illustrated by the size of an icon representing the product 3002. In one embodiment, the popularity level of the product
  • 3002 may be determined from other sites.
  • the popularity level of the product 3002 may be illustrated by an element other than the size of the icon representing the product 3002.
  • a subset of the visual representations may be highlighted based on defined criteria.
  • the highlighting may include using a different color text, a different icon type, a different icon or text size, etc.
  • the criteria may include such criteria as most popular item, items currently being co-displayed on the user interface, an item selected by a user, best value, etc.
  • the popularity level of the product 3002 may be illustrated by the color of the icon, the shape of the icon, whether the icon is flashing or not, etc.
  • one or more additional elements may be incorporated into the appearance of the icon representing the product 3002 in the box diagram 3000.
  • the icon may be sponsored by a third party, and may include a logo or advertisement provided by the third party.
  • the icon may visually indicate whether the product 3002 is currently on sale.
  • the information used to determine whether to visually indicate that the product 3002 is on sale may be determined by researching one or more online resources utilizing a web crawler or other means.
  • any variety of visual elements may be incorporated into the appearance of the icon representing the product 3002.
  • the fourth dimension displayed by the box diagram 3000 includes the feature/price of the product 3002 relative to other popular products in this category. For example, this may be shown by the location and/or coordinates of the icon representing the product 3002 in the box diagram 3000 relative to icons of other popular products in this category, hi one embodiment, the position of the icon may be continuously updated. In another embodiment, the position of the icon may be updated at regularly scheduled intervals. Of course, however, the position of the icon may be updated in any manner. In this way, the position of the icon may always be relative to current statistical information regarding the product 3002.
  • the icon may be moved to a different location on the box diagram 3000 and the icon may additionally be highlighted. This may provide superior visual indicators with respect to the deal over a static product listing.
  • box diagram 3000 may be accompanied by a second diagram containing detailed information about the product 3002.
  • the second diagram may include a summary of the features of the product 3002, a price of the product 3002, other product facts for the product 3002, etc.
  • additional information may be incorporated into the box diagram 3000.
  • a plurality of icons representing additional products may be placed in the box diagram 3000 to illustrate where all products are for a category.
  • a plurality of icons representing one or more manufacturers may be placed in the box diagram 3000 to show who makes what type of product.
  • a plurality of icons representing one or more stores may be placed in the box diagram 3000 to show who carries high-end vs. low-end products.
  • one or more portions of the box diagram 3000 may be sponsored.
  • one or more icons may be added to the box diagram 3000 that indicate deals on related products available that day (e.g. "daily deals").
  • the filtering may display only icons for products manufactured by a particular manufacturer.
  • the filtering may display only icons for products that are sponsored.
  • one or more visual indicators may appear when a user interacts with the box diagram 3000.
  • one or more pop-ups may appear when the user hovers over the icon representing the product 3002.
  • one or more pop-ups may appear when the user clicks on the icon representing the product 3002.
  • the visual indicators may appear when a user interacts in any manner with any element of the box diagram 3000.
  • the price and feature level criteria used in the box diagram 3000 may be used as anchoring dimensions for the incorporation of additional information in the aforementioned embodiments.
  • the box diagram 3000 in Figure 30 is a summarized piece of information regarding a particular feature for a variety of products within a category.
  • price information, popularity information, or other feature information may be obtained and/or extracted from one or more sources.
  • the price and popularity information may be provided by a third party source, one or more partners, one or more web crawlers, manual data entry, etc.
  • the features may include Boolean data (e.g., whether the product has a particular feature), range data (e.g. megapixel size of a digital camera, screen size of a television, etc.).
  • Boolean data e.g., whether the product has a particular feature
  • range data e.g. megapixel size of a digital camera, screen size of a television, etc.
  • weight may be given to each individual feature element, based on a comparison with a global universe of products in the category in which the feature is located, and the prices of the products that have the feature.
  • the probability distribution of a given feature with respect to price may be approximated by dividing the price range into a large number of intervals. These intervals may be selected uniformly, non-uniformly, based on some statistical distribution, etc.
  • the products may be arranged by price, and an interval may be selected at every fifty dollar price increase.
  • the products may be arranged by price, and an interval may be selected after every ten products.
  • the actual algorithm used to define these intervals may be determined in any manner.
  • the products have been organized by price have further been divided into n intervals separated by the following n+1 points: O, P / , P 2 , • ••, P n .
  • the next step may involve counting the number of occurrences of the feature within each price interval.
  • a resulting histogram may define the distribution of the feature in terms of price. For example, a price range graph may be created for the feature.
  • the mean and standard deviation (f avg , f» ⁇ / ) of the feature are then computed based on the distribution. For example, the mean may be calculated by multiplying the frequency of the feature by the value of the product containing that feature in terms of price. This may be utilized to create a weighted value for the feature.
  • the price range graph for the 40 inch screen feature may be analyzed. If the mean price of products with a 40 inch screen is 1000 dollars, but the standard deviation is large, then the feature varies greatly between products. Therefore, the weight of the value given to the 40 inch screen feature may be reduced.
  • FIG. 33 An example of a graph 3300 of a feature with a large standard deviation is shown in Figure 33. As shown, products in all price ranges have the feature. [00371] In another example, if the product is a television with a 50 inch screen, and the feature value to be calculated is for the screen size of the television, the price range distribution for the 50 inch screen feature may be analyzed. If the mean price of products with a 50 inch screen is 4000 dollars, but the standard deviation is small, then it is more likely that the value of the feature is consistent between products. Therefore, the weight of the value given to the 50 inch screen feature may be increased. An example of a graph 3400 for a feature with a small standard deviation is shown in Figure 34.
  • the value of the feature may be weighted based on the type of feature that is analyzed. For example, if a television contains a plasma flat panel display, and plasma displays are a known high quality component, then the value of the feature may be increased.
  • a score may be computed for the value of a feature, and may be used in the computation of the final feature value for a product.
  • Identifying features from numeric attributes [00375] The previous sub-section defines steps that may be used to compute the mean and standard deviation for the feature. In case of nominal attributes, ordinal attributes, and/or any other attributes having a finite set of fixed values, each different feature for the attribute may become an independent feature. For example, the maximum display format supported (108Op, 72Op, etc.) for the product is a nominal attribute. In this case, 1080p, 72Op, etc., may become individual features for the "Display Format Supported" attribute.
  • the attribute may include the screen size of the product (e.g., 20 inches, 25 inches, 30 inches, etc.).
  • the attribute may include a Boolean value.
  • the attribute may indicate whether the product has an LCD display.
  • it may be assumed that, in case of a nominal attribute, almost all products in a category will have one of the values already seen for the training products.
  • a mechanism is needed to convert the real values into a finite set of values.
  • the real values may need to be converted into ordinal (finite, but ranked set) or nominal values.
  • a set of rules may convert the real values into a small set of values. Examples of real attributes may include "Dimensions" (such as height, weight, width, length, etc.), "Resolutions,” "Focal Length,” etc.
  • a range of values may be grouped together to form a finite set of values.
  • the weight of a product may be organized as a finite set of values including the range of values of 10-12 pounds, 12-14 pounds, 14-17 pounds, 17-20 pounds, 21 or more pounds, etc.
  • the screen size of a product may be organized as a finite set of values including the range of screen size values.
  • a final feature value may be calculated for the product by summing all the individual feature values for the product. This sum may be weighted based on the standard deviation, mean, etc. for the individual feature values. As a result, a "low,” “mid,” or “high” rating for the product based on the feature values rates the product not just with respect to price, but also with respect to feature value.
  • the final feature value for a product may be defined as a weighted sum of the mean (or other measures such as median, min, max, an arbitrary percentile, etc.) for the value of each individual feature that a product has. Note that entries for any real attributes may need to be converted into respective ordinal or nominal attributes.
  • the final feature may be calculated, for example, using the equation shown in Table 18.
  • w represents the weight for feature / and ⁇ avgj represents the mean for feature /.
  • the weights may be based on the standard deviations of the features, may be manually defined, or may be statistically computed based on the ability of the feature to discriminate products (which may be determined utilizing a combination of the standard deviation and a spread of distribution).
  • the final feature value, F may define the price interval that a particular product belongs to based on all its features (e.g., where the product falls on a product snapshot in comparison to other products). This can be thought of as the facts-based value of a product as compared to a list of currently available products.
  • a television with a resolution of 1080i has a mean price of 1500 dollars
  • a television with a screen size of 40 inches has a mean price of 1000 dollars
  • the feature values may be weighted for more accuracy.
  • each product may then be plotted in terms of its computed facts-based price and its actual price in order to get the product snapshot.
  • LO, MID, and HI ranges may be determined based on the distribution of the products in the snapshot. For example, the ranges may be determined based on one or more gaps in the distribution of the products. As a result, the ranges may be based on the final feature value for all products. [00385] Further Tailoring of Snapshot Computation (00386] Missing Features
  • products with missing attribute values may exist.
  • a given feature may pull a product towards a particular value (e.g., a price interval). If a feature was absent during training and is later seen while classifying a new product, then this feature may be added to the training at a later stage. In one embodiment, new features may be flagged and incorporated into training in the next classification iteration. [00388] In another example, a feature that is missed during training may be noticed during classification. This feature may be marked or flagged as not having been looked into during training. As a result, during the next training session, the feature may be added to the training set. As a result, the final feature value may be more accurately calculated.
  • a particular value e.g., a price interval
  • a new feature may be added to the product after training has occurred. This feature may be flagged and included in retraining. If the feature is only found in a few products, the weight of the feature may be lowered. However, as more products implement the feature, the weight of the feature may rise. [00390] In another embodiment, a product may happen to have a missing attribute during classification. As a result, it may become difficult to compare the product to other products in the same category by using the snapshot. However, various embodiments of the present invention include a method to handle the occurrence of missing attributes during classification.
  • the screen size of a flat panel television may not be available in the feature specifications retrieved from data from a partner source.
  • this and other values may be automatically calculated and manually entered during classification.
  • an unavailable value may be estimated and manually entered during classification.
  • similar products to the particular product within the category may be determined by searching for features that the particular product is known to have. These similar products may be examined in order to estimate the values of any unavailable feature specifications.
  • the training set may have very few entries for a given feature.
  • the feature may disproportionately affect the final feature score.
  • the computed feature distribution may have a lower accuracy when very few entries exist.
  • these attributes may need special handling.
  • This issue can be identified for ordinal attributes (e.g., attributes whose values are ranked in an order) a lower ranked attribute value is determined to have a higher mean feature value than a higher ranked attribute value.
  • an available training data set may yield a higher value for "Contrast” ratio value of "5000:1" than for "10000:1" due to only one high priced product having the value "5000:1”, whereas a number of lower priced products may have the "10000:1" value.
  • use of manual overriding and/or computer generated values/estimates may be used to correct these features.
  • an available training data set may include a single product with a "10000:1" contrast ratio value. If it can be determined that a higher contrast ratio value is more desirable feature, a weight can be manually assigned to the feature, despite the fact that a single product has the feature. This manual assignment may be automatically recognized. A feature may be determined to be more desirable in a variety of ways.
  • an inherent ordering may exist.
  • the ordering of a maximum resolution of products within a category may be inherent (e.g., "1080p,” “72Op,” “48Op,” etc).
  • an order of features may be manually assigned.
  • a number of other well-known mathematical techniques may be applied to approximate or optimally determine a feature's "inherent value.”
  • the values may be manually set to automatic estimation.
  • a feature's inherent value may be computed in terms of the product's price.
  • some other attribute may be selected instead of the price to compute the inherent value for a feature.
  • Each product, prod consists of a list of features, f ( ⁇ / . Let the inherent value of a feature f be represented by I(f). Let p, represent the price for the product, prod,. Assuming that the features define a product's price, we can represent this by the equation illustrated in Table 19.
  • the system of equations may be solved for each I(f ( ⁇ / ).
  • I(f ( ⁇ / ) There are a number of optimization and approximation techniques that are developed for solving such a system of linear equations.
  • An example would be Least Squares Approximation.
  • Nonlinear (polygonal, Gaussian, quadratic) approaches may also be used to represent and solve such a system.
  • a "baseline" feature level may be determined for one or more products. See graph 3100 in Figure 31. For example, all products that are predetermined to fall within a certain classification, e.g., of a certain type, having a specific feature, etc., may be determined, and a histogram may be plotted according to the prices of the products. For example, the graph 3100 of Figure 31 depicts a number of products having a specific feature vs. the price of the products. Additionally, the minimum, maximum, median, and standard deviation of the prices of the products may be calculated, and based on these values, the products may be divided into three sections: a low section 3102, a mid section 3104, and a high section 3106.
  • Pricing is utilized in the current example to make the initial division because pricing may roughly determine the type of the product. For example, in the consumer's mind, "high-end” may be determined by the product's feature set, manufacturer brand, price, quality, buzz, and other factors. Market pricing may capture these factors. Therefore, using the pricing alone, the initial "training" set may be created for high, mid, and low end products.
  • classification may be performed using each product's attribute values to create "baseline” feature vectors that differentiate the three sections. This creates a vector of product attributes and the probability of the attribute occurring in one of the low, mid, and high sections.
  • baseline feature vectors
  • each product may be classified into its "baseline” low, mid, high section. During this classification, some products that were inside one section based on the product price alone can migrate into another section based on a combination of price and features.
  • the prices within each type may be analyzed to produce average, median, high, and low prices for each product type. In one embodiment, some of the boundaries may overlap.
  • FIG. 32 An example of a "baseline" feature level graph with overlap is shown in Figure 32. 100406] Additionally, a set of product attributes may be established. Additionally, each attribute's affinity with a high, mid, and low product type may be determined. For example, it may be determined that "8MP" is a common feature for a high-end product, but not for low-end product.
  • the product's feature level may be adjusted in order to determine its "real” feature level.
  • the "real” feature level may be somewhere in a contiguous range from 0 to 1. This "real" feature level may then be used to characterize the product.
  • the adjustment may be performed by giving each product an initial feature value according to which "baseline” feature level it is in. For example, if the "baseline” feature level of the product is "low”, then the starting feature value may be 0.15. In another example, if “baseline” feature level of the product is “mid”, then the starting feature value may be 0.5. In still another example, if the "baseline” feature level of the product is "high”, then the starting feature value may be "0.85". [00410] Furthermore, the feature value of the product may be increased if the product has features that are found in a higher "baseline” feature level. For example, a low-end product with a high-end feature will receive an increase in the feature value for its feature level.
  • the feature value of the product may be decreased if the product is missing a feature that is common in the feature level in which it is located.
  • the feature value obtained after making the aforementioned adjustments may be the "real" feature level of a product. This "real" feature level may be higher or lower than the product's "baseline” feature level.
  • an optimal feature level computation may be used that is independent of the price ranges.
  • the basic steps in this process may include approximating the feature to price distribution for each individual feature, and then computing a single final feature value for the product based on its specified features. In this way, a better approximation to the actual feature to price distribution of all products in the category is relied on.
  • the products of the category may be displayed on a feature-price chart 3500, as shown in Figure 35.
  • the feature-price chart 3500 includes a feature level indicator
  • the feature-price chart 3500 includes a price indicator 3504 which indicates whether a particular product is low-priced, mid-priced, or high-priced.
  • 3500 may be centered along the diagonal axis from (low feature, low price) to (high feature, high price).
  • the diagonal axis may represent the probability of total feature value for a particular price point.
  • Products within this area are the average low-end, mid-end, and high-end products. This is natural because the feature-level and the price-level are created based on the price. However, since each product is further adjusted using its individual features against the likely features of various feature levels, the products may appear scattered when plotted on the feature-price chart 3500.
  • one or more products may occur outside of the diagonal axis.
  • the product if the product is located close to the axis, it may likely be a fair value. If the product is located at a higher point above the axis for a particular price range, it may be a better value within that price range.
  • FIG. 36 One analysis of the significance of product location on a feature-price chart is shown in Figure 36.
  • the product may include the latest technologies, may be a well known name brand, may be a professional consumer product, and/or may include any other characteristic considered to be "high-end.”
  • the product may include an average brand name, popularity, quality, and/or may include any other characteristic considered to be "average.” Further still, if the product is located near location 3606, the product may include a prominent brand name, high popularity and/or fashion, high quality, and/or may include any other characteristic considered to accompany a high priced product with fewer features when compared to the competition. [00421] In addition, if the product is located near location 3608, the product may serve a purpose as a secondary product or a product for children, may be larger and/or heavier than the competition, may serve as a gift item, and/or may include any other characteristic considered to accompany a low priced product with a small amount of features when compared to the competition.
  • the product may be on sale, may be from a previous generation of products, may come from an unknown or second tier manufacturer, and/or may include any other characteristic considered to accompany a lower priced product with more features when compared to the competition.
  • a second display may accompany a feature-price chart.
  • An example of such a display is found in display 3700 in Figure 37.
  • the display 3700 may include a summary of the average features for one or more predetermined categories.
  • the display 3700 may include a summary of the average features for the low, mid, and high-end products shown on the feature-price chart
  • the display 3700 may include one or more links to additional features available to the user.
  • the display 3700 may include a link to select further preferences in order to narrow search criteria and reduce the amount of products displayed on the feature-price chart 3500.
  • the display 3700 may include a summary of one or more products shown on the feature-price chart 3500.
  • the display 3700 may include a list of images representing products displayed on the feature-price chart 3500. In one embodiment, these products may be shown in more detail in a separate display.
  • the probability of the occurrence of various individual features for a particular product at a particular price point may be displayed. In this way, a standard may be set for what to expect for a particular product in the market today.
  • the visual representation of the products in relation to each other based on the value of the products in relation to each other may be output in any manner.
  • the visual representations may be presented on a plot of features vs. product price.
  • data may be output which corresponds to an additional visual representation indicating whether the product has at least one of a larger feature set, a smaller feature set and a comparable feature set relative to the other products.
  • a simple description may be used in addition to a box diagram (for a box diagram example, see box diagram 3000 of Figure 30).
  • the box diagram may serve the purpose of interesting the user to look at one or more product facts.
  • the description may explain the meaning of the box in simple terms.
  • An example of a description including a simple list 3800 is shown in Figure 38. As shown, the simple list 3800 contains information about 2 products.
  • the explanations 3802A-B for the illustrated products may be produced using one or more factors, including, but not limited to popularity, brand, quality, etc.
  • popularity the number of expert/user reviews may be counted.
  • brand can be editorially created.
  • quality may be obtained from other sources that performed a survey of product quality.
  • the explanations 3802A-B may be automatically generated for every product based on an algorithm. This may be done by establishing a mapping between a characteristic of the product and elements of that characteristic. For example, the "high price” characteristic may be mapped to elements such as "name brand,” “most popular,” “high fashion,” etc.
  • each product may have its own table of mapped characteristics, as described above.
  • An algorithm may then generate a text description for each characteristic using the product feature level and the description mapping.
  • data may be output corresponding to an additional visual representation that indicates whether the product is at least one of a good value, a bad value and a comparable value relative to the other products.
  • the explanations 1602 A-B may include one or more symbols to indicate the nature of one or more characteristics of the product or the overall product itself.
  • the explanations 1602 A-B may include a "thumbs up” to indicate a good value, a "thumbs down” to indicate a high price, a "sideways thumb” to indicate a fair price, etc.
  • the explanations 1602A-B may include one or more links to additional information.
  • the explanations 1602A-B may include a link to more information regarding the products displayed, a link to check available prices from one or more sellers of the products, etc.
  • a ranking and/or a listing may be displayed for a particular category of products.
  • products within the category may be listed based on a characteristic. For example, the top five televisions with 40 inch screens may be displayed in order of popularity. In another example, the top five televisions sold by a particular manufacturer may be displayed in order of value. Of course, however, any type of ranking and/or listing may be used.
  • the ranking and/or listing may be accomplished by selecting a subset of a box diagram and ordering the products by particular criteria. For example, the value of the products may be organized by ranking the products by their distance from the diagonal axis of the box diagram.
  • each of the products may be assigned to at least one group based on the price and feature set of the products, and data corresponding to a visual representation indicative of the grouping may be output.
  • grouping may include separation of the products into such things as: high end, midrange, low end; best values overall, worst values overall; and may also take into account other factors such as market buzz (e.g., "what's hot"), etc. It should also be noted that products may fall into more than one grouping in some embodiments.
  • the product snapshot may be utilized for product research. In one embodiment, after one or more products are classified into locations on the snapshot box, one or more of the following views may be produced. Of course, however, any other views that can be created based on the products may be produced. [00443] Show most popular products
  • the products for which data is output may be determined to be the most popular products in a larger set of products.
  • most popular products across several product classes may be highlighted.
  • a user interface may provide mouse over functionality which pops up product details when a mouse icon hovers over a particular product. In this way, a user may be given a quick comparison of where the most popular products are, or may be given a sense of the price feature differences between the most popular products. An example of this functionality is shown in a product snapshot 3900 in Figure 39. [00445] Manufacturer type
  • the product snapshot may be utilized in order to show what kind of product a manufacturer makes.
  • the kind of product may be organized from high end to low end.
  • An example of this functionality is shown in a product snapshot manufacturer grouping 4000 in Figure 40.
  • This product snapshot may help a consumer to choose a product by manufacturer by illustrating the kind of product the particular manufacturer makes, thereby saving the consumer independent research time.
  • the product snapshot illustrating manufacturer type may be utilized for marketing.
  • the product snapshot may be used to assist in analyzing competitors.
  • the result of the filtering may be shown visually.
  • a graph 4102 illustrates a set of products matching particular criteria. If another condition is added (for example, the condition that the product contain "2 HDMI ports"), the graph 4102 is updated to a graph 4104, where only 4 products satisfy the new condition. In this way, a user searching for a product with one or more particular features, a particular price, etc. may narrow down the number of products available by those criteria.
  • Drill down or focus search
  • user input specifying a subset of the products may be received, and outputting data corresponding to a visual representation of the subset of products.
  • the subset of products may all contain a particular product attribute.
  • the user input may include selection of at least one of a price range, a feature set, and a manufacturer of the products.
  • the user may be able to focus on a particular region, product, etc. on the snapshot diagram.
  • additional information may be made available from within the snapshot diagram. For example, a link to a manufacturer's product page may be made available when a particular product is chosen.
  • user input requesting output of information about at least one additional product having some user-selected relationship to one of products may be received. For example, products similar to a chosen product may be highlighted when the chosen product is selected (e.g., a square may form around all similar products on the snapshot diagram, etc.). Additionally, key attributes of the similar products may be displayed.
  • the user may select "show me better products”, “show me comparable products”, “show me products with a comparable feature set and lower price”, etc.
  • products that are determined to be “better,” “similar,” “cheaper,” etc. may be determined and displayed.
  • the product snapshot diagram may be a way of navigating the product space.
  • One advantage of this kind of navigation is that it is useful to go from all products to a set of fewer products in order to perform further detailed price or feature research. This is a unique approach comparing to the traditional directory hierarchy or attribute based search.
  • each stage of the navigation may create criteria which narrow the number of products to be shown in the snapshot diagram.
  • the snapshot diagram may provide an instant comparison of the products, which allows the user to select the next set of criteria.
  • snapshot diagram 4300 displays all products within a particular category, with the most popular products highlighted. If the user wants to view only products from a particular manufacturer, the display may be refined, as illustrated in snapshot diagram 4302. If the user then wants to view only mid- priced products from the manufacturer, the display may be further refined, as illustrated in snapshot diagram 4304. The remaining displayed products may then be considered as purchase candidates. For example, the user may perform more detailed comparisons amongst the products with respect to price, feature, etc.
  • a product snapshot may be computed for a specific subset of product features in order to cater to specific market segments.
  • a total cost of ownership (TCO) snapshot may be based on a small subset of attributes such as type and frequency of replacement of consumables, content, accessories, etc.
  • a GI (Green Index) snapshot may be computed from attributes such as energy efficiency, types of battery, recycling, rechargeability, wattage, etc.
  • the Green Index may be computed independent of the price and plotted against the price.
  • an energy value snapshot may be computed.
  • dynamic snapshots may also be provided, in which the user can select the set of attributes they are interested in. Various products can then be compared through this snapshot based only on the features selected by the user. For example, the products within a certain category may be ranked only based on the attributes selected by the user.
  • one or more product snapshots may be monitored over a predetermined or infinite period of time. As a result of this monitoring, a series of graphs may be collected based on the time series of the product snapshots.
  • This series of graphs may be analyzed in order to derive more information from the product snapshots.
  • product snapshots may be monitored in order to determine how long a particular product has remained a best value within its category. This determination may in turn be illustrated in a time based product snapshot.
  • a "bestseller list" may be determined for a particular category for a predetermined time period.
  • time based product snapshots may be updated in real time.
  • various embodiments of the invention discussed herein are implemented using the Internet as a means of communicating among a plurality of computer systems.
  • One skilled in the art will recognize that the present invention is not limited to the use of the Internet as a communication medium and that alternative methods of the invention may accommodate the use of a private intranet, a Local Area Network (LAN), a Wide Area Network (WAN) or other means of communication.
  • LAN Local Area Network
  • WAN Wide Area Network
  • various combinations of wired, wireless (e.g., radio frequency) and optical communication links may be utilized.

Abstract

Selon un mode de réalisation, la présente invention concerne un procédé permettant d'analyser et d'indexer un document non structuré ou semi-structuré comprenant la réception d'un document non structuré ou semi-structuré ; la conversion du document en un ou des flux textuel ; l'analyse de l'un ou des flux textuel pour identifier un contenu textuel du document ; l'analyse de l'un ou des flux textuels pour identifier des sections logiques du document ; l'association du contenu textuel avec des sections logiques du document ; l'indexation du contenu textuel et de leur association avec les sections logiques ; la sauvegarde d'un résultat de l'indexation dans un dispositif de stockage. L'invention concerne également d'autres systèmes et procédés.
PCT/US2008/004545 2007-04-16 2008-04-08 Traitement et recherche de documents non structurés ou semi-structurés et génération d'information en fonction de valeurs WO2008130501A1 (fr)

Applications Claiming Priority (10)

Application Number Priority Date Filing Date Title
US91210807P 2007-04-16 2007-04-16
US60/912,108 2007-04-16
US11/737,660 2007-04-19
US11/737,660 US8504553B2 (en) 2007-04-19 2007-04-19 Unstructured and semistructured document processing and searching
US11/737,684 2007-04-19
US11/737,668 US8290967B2 (en) 2007-04-19 2007-04-19 Indexing and search query processing
US11/737,668 2007-04-19
US11/737,684 US7917493B2 (en) 2007-04-19 2007-04-19 Indexing and searching product identifiers
US11/963,684 US20080255925A1 (en) 2007-04-16 2007-12-21 Systems and methods for generating value-based information
US11/963,684 2007-12-21

Publications (1)

Publication Number Publication Date
WO2008130501A1 true WO2008130501A1 (fr) 2008-10-30

Family

ID=39875788

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/004545 WO2008130501A1 (fr) 2007-04-16 2008-04-08 Traitement et recherche de documents non structurés ou semi-structurés et génération d'information en fonction de valeurs

Country Status (1)

Country Link
WO (1) WO2008130501A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017017678A1 (fr) * 2015-07-27 2017-02-02 Opisoft Care Ltd. Système et procédé de recherche de phrase dans une section de document
US11030387B1 (en) * 2020-11-16 2021-06-08 Issuu, Inc. Device dependent rendering of PDF content including multiple articles and a table of contents
WO2021260684A1 (fr) * 2020-06-21 2021-12-30 Avivi Eliahu Kadoori Système et procédé de détection et d'auto-validation de données clés dans un document non manuscrit quelconque
WO2022166828A1 (fr) * 2021-02-03 2022-08-11 易保网络技术(上海)有限公司 Procédé et système d'indexation de données, et support de stockage
US11416671B2 (en) 2020-11-16 2022-08-16 Issuu, Inc. Device dependent rendering of PDF content
US11663215B2 (en) 2020-08-12 2023-05-30 International Business Machines Corporation Selectively targeting content section for cognitive analytics and search

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030220915A1 (en) * 2000-04-24 2003-11-27 Lawrence Fagan System and method for indexing electronic text
US20040247206A1 (en) * 2003-02-21 2004-12-09 Canon Kabushiki Kaisha Image processing method and image processing system
US20060089947A1 (en) * 2001-08-31 2006-04-27 Dan Gallivan System and method for dynamically evaluating latent concepts in unstructured documents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030220915A1 (en) * 2000-04-24 2003-11-27 Lawrence Fagan System and method for indexing electronic text
US20060089947A1 (en) * 2001-08-31 2006-04-27 Dan Gallivan System and method for dynamically evaluating latent concepts in unstructured documents
US20040247206A1 (en) * 2003-02-21 2004-12-09 Canon Kabushiki Kaisha Image processing method and image processing system

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017017678A1 (fr) * 2015-07-27 2017-02-02 Opisoft Care Ltd. Système et procédé de recherche de phrase dans une section de document
WO2021260684A1 (fr) * 2020-06-21 2021-12-30 Avivi Eliahu Kadoori Système et procédé de détection et d'auto-validation de données clés dans un document non manuscrit quelconque
US11663215B2 (en) 2020-08-12 2023-05-30 International Business Machines Corporation Selectively targeting content section for cognitive analytics and search
US11030387B1 (en) * 2020-11-16 2021-06-08 Issuu, Inc. Device dependent rendering of PDF content including multiple articles and a table of contents
US11416671B2 (en) 2020-11-16 2022-08-16 Issuu, Inc. Device dependent rendering of PDF content
US11449663B2 (en) * 2020-11-16 2022-09-20 Issuu, Inc. Device dependent rendering of PDF content including multiple articles and a table of contents
US11775733B2 (en) 2020-11-16 2023-10-03 Issuu, Inc. Device dependent rendering of PDF content including multiple articles and a table of contents
US11842141B2 (en) 2020-11-16 2023-12-12 Issuu, Inc. Device dependent rendering of PDF content
WO2022166828A1 (fr) * 2021-02-03 2022-08-11 易保网络技术(上海)有限公司 Procédé et système d'indexation de données, et support de stockage

Similar Documents

Publication Publication Date Title
US10169354B2 (en) Indexing and search query processing
US8171013B2 (en) Indexing and searching product identifiers
US8504553B2 (en) Unstructured and semistructured document processing and searching
US9639609B2 (en) Enterprise search method and system
JP4587512B2 (ja) ドキュメントデータ照会装置
US9171081B2 (en) Entity augmentation service from latent relational data
CN108763321B (zh) 一种基于大规模相关实体网络的相关实体推荐方法
US20140324808A1 (en) Semantic Segmentation and Tagging and Advanced User Interface to Improve Patent Search and Analysis
US20120183206A1 (en) Interactive concept learning in image search
US20070250501A1 (en) Search result delivery engine
US20130110839A1 (en) Constructing an analysis of a document
US20120265779A1 (en) Interactive semantic query suggestion for content search
US20120162244A1 (en) Image search color sketch filtering
WO2008130501A1 (fr) Traitement et recherche de documents non structurés ou semi-structurés et génération d'information en fonction de valeurs
Jannach et al. Automated ontology instantiation from tabular web sources—the AllRight system
KR20090049433A (ko) 색상 키워드를 이용한 검색 방법 및 시스템
Jo et al. Smart learning of logo detection for mobile phone applications
Alcic et al. 2-DOM: A 2-Dimensional Object Model towards Web Image Annotation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08742659

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08742659

Country of ref document: EP

Kind code of ref document: A1