WO2006000660A2 - Dynamic method for automatically putting on-line extracts from paper document holdings - Google Patents
Dynamic method for automatically putting on-line extracts from paper document holdings Download PDFInfo
- Publication number
- WO2006000660A2 WO2006000660A2 PCT/FR2005/001092 FR2005001092W WO2006000660A2 WO 2006000660 A2 WO2006000660 A2 WO 2006000660A2 FR 2005001092 W FR2005001092 W FR 2005001092W WO 2006000660 A2 WO2006000660 A2 WO 2006000660A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- documents
- dynamic method
- user
- image
- search
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
Definitions
- the invention relates to a dynamic process for automatically placing on line wired user stations a selection of extracts from a collection of documents available on paper.
- the invention relates to the field of placing large paper stocks online or, in general, documents for which a source file is not available.
- the invention is more particularly the consultation on public networks of documents "heavy" in terms of file size, for example greater than the megabyte. These documents, such as: printed documents or press, catalogs, communication documents, photographs, plans, maps, etc., are usually difficult to transfer over the networks.
- Another object of the invention is to allow secure consultation of documents.
- the search engines of the state of the art essentially operate on information provided in the form of office documents or "PDF Text" format generated from these same office tools.
- character recognition products for example optical recognition or "OCR" hide the text information behind the image. It is then possible to index this text in an engine, then to produce the entire image when it meets the search criterion.
- OCR optical recognition
- the image is always presented as a whole: it is then necessary to open the document in its entirety to validate if this one is indeed interesting, from where a very laborious stripping of the Result of the research.
- WAN response times are very slow because they manipulate the image, which requires heavy files.
- only limited intranet applications on very fast networks were able to emerge. It therefore appears difficult, if not impossible, to display the result of character recognition to the user over wide area networks for qualitative reasons, especially on color documents.
- the search result lists are not really usable because the context of the searched word is not provided.
- the access time to documents is prohibitive on WAN networks.
- the invention proposes an operating kinematics which overcomes these shortcomings in order to allow, in particular, a consultation of documents under optimized conditions in terms of access time, selectivity and quality of this access.
- the approach taken by the invention consists in constructing a search engine capable of correctly exploiting, that is to say without destructuring, the text in the image on documents provided in their final form, namely on a medium of paper, by direct extraction as a thumbnail.
- the subject of the invention is a dynamic method of automatically placing, on wired networked user stations, a collection of documents available on paper, consisting of (i) developing an industrial production line realizing , on pages coming from digitized or digital documents, treatments for improving the quality of the document and extracting information relating to the text, their geolocation then their indexing and compression of these pages, (ii) presenting to the user who has requested at least one word from the search engine for full text information and associated metadata, a result list in the form of dynamically generated thumbnails centered on the word (s) search (s) thus isolating a context of use of the page according to a given mode, and (iii) make by a plug-in an accelerated presentation of the page to the user for reading , whatever its resolution.
- the plug-in is a document viewing plug-in of any kind, for example an image or a composite document.
- the operating kinematics of the invention thus makes it possible to avoid the need to systematically open any document proposed by the search engine and to respect the waiting time tolerated by a user, which does not exceed statistically 5 seconds, at the same time. access to documents especially large documents.
- the chain is fed from files from paper scanning and / or from PDF or office digital files; the sorting of the search result is carried out from the font of at least one searched keyword to propose a function which is equivalent to a search by title; a filtering is carried out from types of descriptive fields of metadata such as dates, document titles, themes, headings, advertising messages, etc., defined and previously informed; the image compression is of progressive pyramidal type; - The mode of presentation of the thumbnails being selected from short thumbnails, long thumbnails, and the mixed presentation mode of the thumbnail associated with a thumbnail representation of the page in its entirety; an encryption function of the image is performed; interactivity functions in the plug-in make it possible to make sensitive areas of the image to refer to hyperlinks, or to graphically select an area of the image; - search engine documentary tools allow a better appropriation of the detected documentary collection, such as "my documents” to build thematic files, "my alerts" to notify the user as soon as a new document is recognized by the request of research.
- FIG. 1 a document search by the introduction of a word - Key, - Figures 2 to 4, different modes of presentation of this result of the search, - Figure 5, the visualization of the page chosen by the plug-in, and - Figures 6 and 7, two documentary tools. appropriation of the documentary background.
- the search engine is similar to a "Google" type of engine, in that it takes up again the simplicity of use and, for each element of the result list, a presentation contextual of the word found.
- the fundamental difference lies, as indicated in the introduction above, in the principle of a direct materialization of this context by extracting the document as thumbnails.
- the engine is powered by an industrial production line that makes, from files from scanning paper documents: - a qualitative treatment to improve the image (straightening, trimming, gamma correction, deflouage, association pages right & left , etc.), - the extraction of text by OCR with an OCR tool - the geolocation of text information, namely the geographical location in the page of each character - the selective analysis of the information contained in the text pages of the document by recognition of characters in order to extract identification metadata, in the example the headings (date, title, theme, topic, advertising messages, etc.), - indexing full text of the document and document metadata being performed by a known indexing engine, - compression and encryption of documents, as detailed below.
- This chain is powered in particular from files from the scanning paper by high speed scanner.
- This background chain, or "back office” has a high degree of automation, thus achieving a very low cost price.
- the full text search is performed from the introduction of the searched keyword 20, for example the word "Porsche" as illustrated in Figure 1.
- the presentation is "dynamic” in that it is performed, using a thumbnail display tool called “Image Context”, as shown below: - the search is for full text information, as extracted from the document by the production line; the search result is then presented to the user in the form of successive thumbnails generated dynamically by means of the geolocation information of the word in each page containing this word, taking into account the zoom factor adapted to be applied, the thumbnails being centered on the most relevant word 20 of the page; - The user can then quickly exploit a result list without the need to open each document, which represents a saving of time and significant comfort. Moreover, to the extent that a glance makes it possible to instantly reject responses that are clearly unrelated to the real subject of the search, the user does not need to be an expert in documentary research.
- the dynamic search engine is based on different text search engines market, which it exploits the possibilities: relevance, fuzzy search, taking into account dates, ...
- two additional features are advantageously integrated into the search engine dynamic: - the sorting of the search result from the font of the searched and found keyword; this feature also makes it possible to search titles; - Filtering from types of topics defined and filled in beforehand.
- the invention uses an access activation program, or visualization "plug-in", by a progressive pyramidal image compression.
- the image document is made accessible from the user's computer through the plug-in ( Figure 5).
- This consultation tool exploits the images previously compressed by the production line. These are first cut into hierarchical tiles of different definitions by the compression software, the plug-in then ensuring the management of requests to the image server and the display of the only portion of image 50 necessary for the realization of the screen display. So, concretely, the plug-in will only look on the server for the information needed for the display and does not wait to have retrieved all the information to start displaying.
- the added value of the plug-in resides mainly: (i) in its activation in network layer which makes it possible to implement different strategies of request to the server to adapt to the bandwidth of the used network (RTC, ADSL, very broadband) , and (ii) in the technical implementation of the compression mechanisms that only use the CPU power of the user station, thus making it possible to serve a large number of user stations from the same server.
- the plug-in offers simplified ergonomics and works entirely in memory: no file filing, temporary or permanent, is performed on the user's computer.
- Interactivity functions in the plug-in make it possible: to highlight zones 39 of the image 50 by highlighting, to make areas 51 sensitive to which the user can perform an action; the preprocessing of the production line thus makes it possible to generate hypertext links, for example a link with addresses of the network, such as www.societe.com, - to graphically select an area of the image to perform a correction of the OCR , or more generally any type of action.
- an encryption function is applied to the header of the image by polynomial algorithms of the 128-bit type. Encrypted header images provide a better defense against hacking.
- the built-in mechanisms match the documents to their server. Thus, documents unloaded fraudulently from their operating server on another machine are unusable.
- the engine offers documentary tools to the user to facilitate the appropriation of the detected documentary collection: - "My documents" 60 ( Figure 6): allows the user to build thematic files, which he can eventually share ; - "My Alerts” 70 ( Figure 7): allows the user to be notified when a new document is recognized by the search query he has previously defined with the engine.
- the invention is not limited to the example described and claimed.
- the image compression plug-in may use a different compression technique, through the use of other algorithms such as for example CCITT4, JBIG.
- the text search engine can integrate different functionalities, for example, different linguistic techniques or fuzzy logic.
- the presentation of the thumbnails can vary by generalizing the mode function.
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR0405588A FR2870616B1 (en) | 2004-05-24 | 2004-05-24 | DYNAMIC METHOD FOR AUTOMATICALLY SETTING EXTRACTS OF PAPER DOCUMENT FUNDS |
FR0405588 | 2004-05-24 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2006000660A2 true WO2006000660A2 (en) | 2006-01-05 |
WO2006000660A3 WO2006000660A3 (en) | 2006-05-18 |
Family
ID=34944869
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/FR2005/001092 WO2006000660A2 (en) | 2004-05-24 | 2005-05-02 | Dynamic method for automatically putting on-line extracts from paper document holdings |
Country Status (2)
Country | Link |
---|---|
FR (1) | FR2870616B1 (en) |
WO (1) | WO2006000660A2 (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0596247A2 (en) * | 1992-11-02 | 1994-05-11 | Motorola, Inc. | A full-text index creation, search, retrieval and display method |
WO1999018523A1 (en) * | 1997-10-08 | 1999-04-15 | Caere Corporation | Computer-based document management system |
-
2004
- 2004-05-24 FR FR0405588A patent/FR2870616B1/en not_active Expired - Fee Related
-
2005
- 2005-05-02 WO PCT/FR2005/001092 patent/WO2006000660A2/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0596247A2 (en) * | 1992-11-02 | 1994-05-11 | Motorola, Inc. | A full-text index creation, search, retrieval and display method |
WO1999018523A1 (en) * | 1997-10-08 | 1999-04-15 | Caere Corporation | Computer-based document management system |
Non-Patent Citations (5)
Title |
---|
GOTTESMAN B ET AL: "Ending the Paper Chase" PC MAGAZINE, A PC COMMUNICATION CORP. NEW YORK, US, 24 octobre 1995 (1995-10-24), pages 129,131,134,13-,154, XP002091671 ISSN: 0888-8507 * |
LU Y ET AL: "Document retrieval from compressed images" PATTERN RECOGNITION, ELSEVIER, KIDLINGTON, GB, vol. 36, no. 4, avril 2002 (2002-04), pages 987-996, XP004398637 ISSN: 0031-3203 * |
MARINAI S ET AL: "A general system for the retrieval of document images from digital libraries" DOCUMENT IMAGE ANALYSIS FOR LIBRARIES, 2004. PROCEEDINGS. FIRST INTERNATIONAL WORKSHOP ON PALO ALTO, CA, USA 23-24 JAN. 2004, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 23 janvier 2004 (2004-01-23), pages 150-173, XP010681126 ISBN: 0-7695-2088-X * |
SHENGJIN WANG ET AL: "Adaptive data transmission on browsing of scanned documents using JPEG2000" CONFERENCE PROCEEDINGS ARTICLE, 10 juillet 2002 (2002-07-10), pages 78-83, XP010620992 * |
YUE LU ET AL: "Retrieving imaged documents in digital libraries based on word image coding" DOCUMENT IMAGE ANALYSIS FOR LIBRARIES, 2004. PROCEEDINGS. FIRST INTERNATIONAL WORKSHOP ON PALO ALTO, CA, USA 23-24 JAN. 2004, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 23 janvier 2004 (2004-01-23), pages 174-187, XP010681127 ISBN: 0-7695-2088-X * |
Also Published As
Publication number | Publication date |
---|---|
FR2870616B1 (en) | 2006-09-15 |
FR2870616A1 (en) | 2005-11-25 |
WO2006000660A3 (en) | 2006-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR100972241B1 (en) | Document retrieving apparatus and document retrieving method | |
CN101201840B (en) | Document indexing equipment and method | |
US9224004B2 (en) | Variable user interface based on document access privileges | |
JP5372369B2 (en) | Digital asset management, targeted search, and desktop search using digital watermark | |
US20080115057A1 (en) | High precision data extraction | |
US20100114991A1 (en) | Managing the content of shared slide presentations | |
US20090216734A1 (en) | Search based on document associations | |
US20110060739A1 (en) | System and method to research documents in online libraries | |
FR2681454A1 (en) | METHOD AND DEVICE FOR PROCESSING ALPHANUMERIC AND GRAPHICAL INFORMATION FOR THE CONSTITUTION OF A DATA BANK. | |
FR2973134A1 (en) | METHOD FOR REFINING THE RESULTS OF A SEARCH IN A DATABASE | |
FR2845236A1 (en) | SYSTEMS AND METHODS FOR INSERTING A METADATA LABEL INTO A DOCUMENT | |
WO2000049526A1 (en) | Similarity searching by combination of different data-types | |
US20070150163A1 (en) | Web-based method of rendering indecipherable selected parts of a document and creating a searchable database from the text | |
KR20060101803A (en) | Creating and active viewing method for an electronic document | |
WO2001088749A1 (en) | Method for constituting a database concerning data contained in a document | |
US20110255113A1 (en) | Document Tag Based Destination Prompting and Auto Routing for Document Management System Connectors | |
EP3005171A1 (en) | Method for searching a database | |
Hoffman et al. | The RightPages™ Service: An image‐based electronic library | |
US11295124B2 (en) | Methods and systems for automatically detecting the source of the content of a scanned document | |
US8131752B2 (en) | Breaking documents | |
WO2006000660A2 (en) | Dynamic method for automatically putting on-line extracts from paper document holdings | |
JP5318233B2 (en) | Document search apparatus, document search method, program, and storage medium | |
Ruocco et al. | Event clusters detection on flickr images using a suffix-tree structure | |
Jones et al. | Abstract images have different levels of retrievability per reverse image search engine | |
FR2790846A1 (en) | DOCUMENT IDENTIFICATION PROCESS |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: DE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 05763711 Country of ref document: EP Kind code of ref document: A2 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 05763711 Country of ref document: EP Kind code of ref document: A2 |