WO2006000660A2

WO2006000660A2 - Dynamic method for automatically putting on-line extracts from paper document holdings

Info

Publication number: WO2006000660A2
Application number: PCT/FR2005/001092
Authority: WO
Inventors: Philippe Belin
Original assignee: Immanens Sas
Priority date: 2004-05-24
Filing date: 2005-05-02
Publication date: 2006-01-05
Also published as: FR2870616B1; FR2870616A1; WO2006000660A3

Abstract

The invention relates to a method for consulting heavy documents on public networks and digitalised paper documents, using a system for the operation of conditions which are optimised in terms of access, selectivity and quality of said access. To this end, the invention relates to a dynamic method for automatically putting on-line holdings of documents available on paper support, on user stations which are wired-up in a network. Said method consists of (i) elaborating an industrial production chain, for pages from digitalised or digital documents, for processing the documents for the qualitative improvement thereof and extracting information relating to the text, for the geolocalisation of said documents, the indexing thereof and the compression of the pages, (ii) presenting a results list in the form of pictures (31) which are dynamically generated and centred on the desired word(s), thus isolating a context of use of the page according to a given mode, to the user who has formulated a request of at least one word (20) to the search engine, relating to the full text information and to associated metadata, and (iii) performing an accelerated presentation of the page to the user for reading, by means of a plug-in, for any resolution.

Description

DYNAMIC METHOD FOR AUTOMATICALLY SETTING EXTRACTS OF PAPER DOCUMENT FUNDS

The invention relates to a dynamic process for automatically placing on line wired user stations a selection of extracts from a collection of documents available on paper. The invention relates to the field of placing large paper stocks online or, in general, documents for which a source file is not available. The invention is more particularly the consultation on public networks of documents "heavy" in terms of file size, for example greater than the megabyte. These documents, such as: printed documents or press, catalogs, communication documents, photographs, plans, maps, etc., are usually difficult to transfer over the networks. Another object of the invention is to allow secure consultation of documents. The search engines of the state of the art essentially operate on information provided in the form of office documents or "PDF Text" format generated from these same office tools. Regarding paper documents, character recognition products, for example optical recognition or "OCR", hide the text information behind the image. It is then possible to index this text in an engine, then to produce the entire image when it meets the search criterion. However, using these tools, the image is always presented as a whole: it is then necessary to open the document in its entirety to validate if this one is indeed interesting, from where a very laborious stripping of the Result of the research. In addition, WAN response times are very slow because they manipulate the image, which requires heavy files. Also, only limited intranet applications on very fast networks were able to emerge. It therefore appears difficult, if not impossible, to display the result of character recognition to the user over wide area networks for qualitative reasons, especially on color documents. The search result lists are not really usable because the context of the searched word is not provided. In addition, the access time to documents is prohibitive on WAN networks. The invention proposes an operating kinematics which overcomes these shortcomings in order to allow, in particular, a consultation of documents under optimized conditions in terms of access time, selectivity and quality of this access. The approach taken by the invention consists in constructing a search engine capable of correctly exploiting, that is to say without destructuring, the text in the image on documents provided in their final form, namely on a medium of paper, by direct extraction as a thumbnail. More specifically, the subject of the invention is a dynamic method of automatically placing, on wired networked user stations, a collection of documents available on paper, consisting of (i) developing an industrial production line realizing , on pages coming from digitized or digital documents, treatments for improving the quality of the document and extracting information relating to the text, their geolocation then their indexing and compression of these pages, (ii) presenting to the user who has requested at least one word from the search engine for full text information and associated metadata, a result list in the form of dynamically generated thumbnails centered on the word (s) search (s) thus isolating a context of use of the page according to a given mode, and (iii) make by a plug-in an accelerated presentation of the page to the user for reading , whatever its resolution. The plug-in is a document viewing plug-in of any kind, for example an image or a composite document. The operating kinematics of the invention thus makes it possible to avoid the need to systematically open any document proposed by the search engine and to respect the waiting time tolerated by a user, which does not exceed statistically 5 seconds, at the same time. access to documents especially large documents. According to particular modes of implementation: - the chain is fed from files from paper scanning and / or from PDF or office digital files; the sorting of the search result is carried out from the font of at least one searched keyword to propose a function which is equivalent to a search by title; a filtering is carried out from types of descriptive fields of metadata such as dates, document titles, themes, headings, advertising messages, etc., defined and previously informed; the image compression is of progressive pyramidal type; - The mode of presentation of the thumbnails being selected from short thumbnails, long thumbnails, and the mixed presentation mode of the thumbnail associated with a thumbnail representation of the page in its entirety; an encryption function of the image is performed; interactivity functions in the plug-in make it possible to make sensitive areas of the image to refer to hyperlinks, or to graphically select an area of the image; - search engine documentary tools allow a better appropriation of the detected documentary collection, such as "my documents" to build thematic files, "my alerts" to notify the user as soon as a new document is recognized by the request of research. Other advantages and characteristics of the invention will appear on reading the following detailed example of embodiment, with reference to the appended figures which represent respectively: FIG. 1, a document search by the introduction of a word - Key, - Figures 2 to 4, different modes of presentation of this result of the search, - Figure 5, the visualization of the page chosen by the plug-in, and - Figures 6 and 7, two documentary tools. appropriation of the documentary background. In the detailed example below, the search engine is similar to a "Google" type of engine, in that it takes up again the simplicity of use and, for each element of the result list, a presentation contextual of the word found. The fundamental difference lies, as indicated in the introduction above, in the principle of a direct materialization of this context by extracting the document as thumbnails. The engine is powered by an industrial production line that makes, from files from scanning paper documents: - a qualitative treatment to improve the image (straightening, trimming, gamma correction, deflouage, association pages right & left , etc.), - the extraction of text by OCR with an OCR tool - the geolocation of text information, namely the geographical location in the page of each character - the selective analysis of the information contained in the text pages of the document by recognition of characters in order to extract identification metadata, in the example the headings (date, title, theme, topic, advertising messages, etc.), - indexing full text of the document and document metadata being performed by a known indexing engine, - compression and encryption of documents, as detailed below. This chain is powered in particular from files from the scanning paper by high speed scanner. This background chain, or "back office", has a high degree of automation, thus achieving a very low cost price. With this search engine, the full text search is performed from the introduction of the searched keyword 20, for example the word "Porsche" as illustrated in Figure 1. The presentation is "dynamic" in that it is performed, using a thumbnail display tool called "Image Context", as shown below: - the search is for full text information, as extracted from the document by the production line; the search result is then presented to the user in the form of successive thumbnails generated dynamically by means of the geolocation information of the word in each page containing this word, taking into account the zoom factor adapted to be applied, the thumbnails being centered on the most relevant word 20 of the page; - The user can then quickly exploit a result list without the need to open each document, which represents a saving of time and significant comfort. Moreover, to the extent that a glance makes it possible to instantly reject responses that are clearly unrelated to the real subject of the search, the user does not need to be an expert in documentary research. Several modes of presentation are proposed: - short thumbnails 31 (Figure 2) - long thumbnails 32 (Figure 3), - mixed presentation of the thumbnail 30 associated with a thumbnail representation of the entire page 40 (Figure 4). The dynamic search engine is based on different text search engines market, which it exploits the possibilities: relevance, fuzzy search, taking into account dates, ... Optionally, two additional features are advantageously integrated into the search engine dynamic: - the sorting of the search result from the font of the searched and found keyword; this feature also makes it possible to search titles; - Filtering from types of topics defined and filled in beforehand. In order to guarantee remote consultation response times of less than 5 seconds, the invention uses an access activation program, or visualization "plug-in", by a progressive pyramidal image compression. The image document is made accessible from the user's computer through the plug-in (Figure 5). This consultation tool exploits the images previously compressed by the production line. These are first cut into hierarchical tiles of different definitions by the compression software, the plug-in then ensuring the management of requests to the image server and the display of the only portion of image 50 necessary for the realization of the screen display. So, concretely, the plug-in will only look on the server for the information needed for the display and does not wait to have retrieved all the information to start displaying. The added value of the plug-in resides mainly: (i) in its activation in network layer which makes it possible to implement different strategies of request to the server to adapt to the bandwidth of the used network (RTC, ADSL, very broadband) , and (ii) in the technical implementation of the compression mechanisms that only use the CPU power of the user station, thus making it possible to serve a large number of user stations from the same server. The plug-in offers simplified ergonomics and works entirely in memory: no file filing, temporary or permanent, is performed on the user's computer. Interactivity functions in the plug-in make it possible: to highlight zones 39 of the image 50 by highlighting, to make areas 51 sensitive to which the user can perform an action; the preprocessing of the production line thus makes it possible to generate hypertext links, for example a link with addresses of the network, such as www.societe.com, - to graphically select an area of the image to perform a correction of the OCR , or more generally any type of action. Advantageously, an encryption function is applied to the header of the image by polynomial algorithms of the 128-bit type. Encrypted header images provide a better defense against hacking. Finally, the built-in mechanisms match the documents to their server. Thus, documents unloaded fraudulently from their operating server on another machine are unusable. The engine offers documentary tools to the user to facilitate the appropriation of the detected documentary collection: - "My documents" 60 (Figure 6): allows the user to build thematic files, which he can eventually share ; - "My Alerts" 70 (Figure 7): allows the user to be notified when a new document is recognized by the search query he has previously defined with the engine. The invention is not limited to the example described and claimed. For example, the image compression plug-in may use a different compression technique, through the use of other algorithms such as for example CCITT4, JBIG. In addition, the text search engine can integrate different functionalities, for example, different linguistic techniques or fuzzy logic. The presentation of the thumbnails can vary by generalizing the mode function. It is also possible to limit the operation to access files in PDF or other format, or to unify the ergonomics of access to mixed media, PDF and scanned images. Finally, it is possible to generalize the recognition languages to locate the detection, not only to the Latin alphabet languages (French, English, Italian, ...) for texts written in these languages, but also for languages with different languages. special characters (Russian, Greek, ...) or ideograms (Japanese, Chinese).

Claims

1. A dynamic method for automatically placing a database of paper-based documents on wired network user stations, characterized in that it consists in (i) developing an industrial production line carrying out, on pages (40) from digitized or digital documents, document quality enhancement processing and extraction of text information, geolocation and indexing thereof and compression of these pages to form a search engine, (ii) ) present to the user who has made a request for at least one word (20) to the search engine, full text information and metadata associated with a result list in the form of thumbnails (31, 32) ) dynamically generated and centered on the searched word (s) (20) thus isolating a context of use of the page according to a given mode of presentation, and (ii) making by a plug-in a presented accelerated page (40) to the user for reading, regardless of its resolution.

The dynamic method according to claim 1, wherein the chain is fed from files from paper scanning and / or from digital PDF or office files.

3. Dynamic method according to claim 2, wherein the sorting of the search result is performed from the font of at least one searched keyword (20) and found to reconstruct the notion of title.

The dynamic method according to any one of the preceding claims, wherein an image encryption function (50) is performed.

5. Dynamic method according to any one of the preceding claims, wherein a filtering is performed from types of metadata descriptive fields defined and previously filled.

The dynamic method of claim 1, wherein the image compression is of progressive pyramidal type.

The dynamic method according to claim 1, wherein the mode of presentation of the thumbnails is selected from the short (31), long thumb (32) and the mixed (30) presentation mode of the thumbnail associated with a thumbnail representation of the document page type as a whole.

A dynamic method according to any one of the preceding claims, wherein interactivity functions in the plug-in enable to render sensitive areas (51) of the image (50) to refer to hyperlinks or to select graphically an area of the image.

9. Dynamic method according to any one of the preceding claims, wherein the search engine's documentary tools allow an appropriation of the detected documentary collection, namely "my documents" (60) to constitute thematic files and "my alerts" (70) to notify the user as soon as a new document is recognized by the search request.