EP1733324A1 - Verfahren zum auffinden von daten, suchmaschine und mikroprozessor dafür - Google Patents

Verfahren zum auffinden von daten, suchmaschine und mikroprozessor dafür

Info

Publication number
EP1733324A1
EP1733324A1 EP05742860A EP05742860A EP1733324A1 EP 1733324 A1 EP1733324 A1 EP 1733324A1 EP 05742860 A EP05742860 A EP 05742860A EP 05742860 A EP05742860 A EP 05742860A EP 1733324 A1 EP1733324 A1 EP 1733324A1
Authority
EP
European Patent Office
Prior art keywords
information
document
character string
documents
stored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP05742860A
Other languages
English (en)
French (fr)
Inventor
Alain Nicolas Piaton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from FR0402939A external-priority patent/FR2868178B1/fr
Priority claimed from FR0409271A external-priority patent/FR2874719B1/fr
Application filed by Individual filed Critical Individual
Publication of EP1733324A1 publication Critical patent/EP1733324A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Definitions

  • the present invention relates to a method of searching for information in documents stored in an electronic memory.
  • the invention also relates to a microprocessor for implementing this method and to a search engine. More precisely, the invention relates to a method for searching for information of the type comprising the following steps: - selection of at least one document from the stored documents, from a request comprising at least one predetermined character string, then - extraction of a result for display in the form of an overview of information relating to the selected document, and - prior to the selection and extraction steps, generation of a table representing the stored documents, comprising a chain of characters comprising at least part of the information of the stored documents.
  • a method is known.
  • the invention aims to remedy these drawbacks by providing a method of searching for information allowing a user to view quickly and effective content of documents selected in response to a request it has made.
  • the subject of the invention is therefore a method of searching for information of the aforementioned type, characterized in that, during the extraction step, the result is generated using the representation table, from information contained in the character string of the representation table deemed relevant according to the request.
  • the predetermined character string of the request is compared with the character string of the representation table, in particular by sequential scanning of the representation table, in order to select at least one document from the stored documents.
  • the representation table is also used as an indexing table for stored documents. It is therefore used both for viewing the content of the documents stored and for searching for these documents from a request comprising at least one predetermined character string. Sequential scanning of the character string contained in the representation table makes it possible to significantly increase the efficiency of the search.
  • At least one stored document being of the electronic mail type and comprising several distinct headings chosen from the set of elements consisting of an address of an issuer, an address of a recipient, an en -header, a message body, and at least one attachment
  • the character string of the representation table comprises at least part of the text type information of each item of the document of the electronic mail type.
  • the character string of the representation table also includes, for each stored document, identification information for this document.
  • viewing and searching for information can take account of this identifying information.
  • at least part of the result of the search for information is stored in memory.
  • the part of the result of the search for information stored in memory is stored in a file capable of comprising several results of several searches.
  • the information search method comprises the following steps: - extracting the information contained in the character string of the representation table deemed to be relevant as a function of the request, - transmission of this information to a remote terminal via a data transmission network, and the display of the result is carried out by the remote terminal.
  • a conversion can be carried out so that any displayable character of a text-type area of the stored documents is coded: - either on a byte; - either using a tag inserted in the representation table and followed by a one-byte code
  • the data set includes data for assistance in presenting the overview, used during the step of extracting the result.
  • the additional data are, for example, layout information making it possible to improve the visualization of the content of the selected documents, in particular to remain faithful to the layout of the content as it was presented in the document itself.
  • the data set can also include data to assist in the selection of at least one document. We can thus imagine additional data inserted using accent markers, synonymies, phonetic writing, etc. Thus, this selection aid data makes it possible to select documents comprising at least one character string close to the predetermined character string defined in the request.
  • An information search method may also include one or more of the following characteristics: - each tag inserted in the character string of the representation table comprises at least one escape character coded on a byte not belonging to the displayable characters appearing in the first 128 positions of the ASCII coding table, - one inserts into the character string of the representation table at least one information zone of numerical type coded on a predetermined number of bytes delimited by at least one tag indicating this digital area, - the tag indicating the digital area is also a tag indicating an agreement to present this digital area, - the stored documents being distributed in different types of documents, we define for each type of document a set of tags intended to be inserted in the bus chain acteres of the representation table, each tag in this set having a specific meaning for this type of document, - we insert into the character string of the representation table at least one set of data expressed in phonetic writing delimited by at least one phonetic writing indication beacon, - at least one indication tag is inserted into the character string of the representation table that a predetermined
  • an information search method may include the characteristic that: - each stored document comprising information distributed in several distinct predetermined headings common to all the stored documents, the result is displayed in the form of an overview comprising a preview zone for each common separate item and comprising a list of documents initially selected for the information it contains deemed relevant according to the search, - each preview zone can be deactivated, and - when deactivates at least one preview zone, each document initially selected is kept in the displayed list only for information deemed relevant that this document includes in at least one section corresponding to at least one preview zone which remains activated.
  • the information search process allows the user to make a quick choice from a set of selected documents provided in response to their request.
  • the invention also relates to a search engine for information in documents stored in an electronic memory, comprising: means for generating a table representing the stored documents, this table comprising a character string comprising at least part of the information of the stored documents, means for selecting at least one document from the stored documents, from a request comprising at least one predetermined character string, characterized in that it comprises means for extracting a result using the representation table, from information contained in the string characters from the representation table deemed relevant according to the query, with a view to displaying this result in the form of an overview of information relating to the selected document.
  • the invention also relates to a microprocessor comprising programmed instructions for the implementation of an information search method as defined above.
  • a microprocessor according to the invention may further comprise means for storing at least one dictionary table comprising a set of words in a predetermined language, each word being associated in this dictionary table with grammatical analysis data.
  • - Figure 1 schematically represents the successive steps implemented for generation a table for representing stored documents, in an information search method according to the invention
  • - Figure 2 schematically shows an example of a character string contained in the representation table of Figure 1
  • - Figures 3 and 4 show viewing windows of a selection of documents, displayed during the implementation of a particular embodiment of the invention
  • - Figure 5 schematically shows a device comprising a master microprocessor and several coprocessors for the rapid execution of a method according to the invention.
  • a method according to the invention uses the following elements: - a set of documents on which one is called upon to carry out searches, namely all types of documents comprising text such as documents from word processors, spreadsheets (noted Doc), or e-mails (noted Mail) with possibly their attachments (noted Att, Zip), these documents being stored either on a computer from which the searches are executed, or in networks internal companies, either outside and accessible via the Internet, - a set of tables, called index tables, to carry out searches, and - a set of tables representing the stored documents, called overview tables, to allow quick display of results.
  • index and overview tables are the same tables which are used both to perform the search and to display the overviews, that is to say, it is the index tables which are used as representation tables for stored documents to display overviews. Thereafter, these tables will be called index and overview tables (denoted TIA).
  • a search method according to the invention requires the following steps: - generation of an index and overview table (ie. A table representing the stored documents) comprising at least part of the information of the stored documents, - search of documents by selecting at least one document from the stored documents, from a request comprising at least one predetermined character string, - display of a result in the form of an overview of information relating to (x ) selected document (s).
  • the index and overview table should allow quick searching and quick viewing of previews. It contains for each document the following two types of information: - on the one hand, the full or partial content of the document in text format, uncompressed, that is to say any element which can be displayed under text form (in the case of e-mails the content of attached documents, whether in compressed form or not, is also stored in the index and overview table). - on the other hand, elements of identification of the document such as the name of the document, its object, a date, its length, keywords, a path to the document on the disc, etc.
  • each document such as Tia-doc is represented by a header (noted
  • Tia-ld followed by all the fields in text format (noted Tia-txt) that can be selected during a search for information.
  • a system of separators is used between the different documents, and between the different elements inside each document in order to allow rapid scanning of the index and overview table.
  • the Tia-ld header gathers numeric data, as well as texts on which there is no search: - an Oxff separator character or any other character which cannot appear in a text file, located at start of the header, - the length of the header, - digital data such as block lengths, various counters, - digital data likely to be searched, called subsequently headings, such the length or the date of the document, - alphabetical data which are not part of the search fields (machine name, customer, language, conversion tables, etc.).
  • Tia-txt a text part
  • these are the contents, keywords, elements of identification of the documents.
  • These different elements hereinafter called headings, are stored one after the other in the form of text, and they are separated by separator characters.
  • the content of each of the attachments of the electronic mails is stored in a separate index and overview table (denoted TIA-Att) called the index table of attachments and a given document appears there only once, even if it belongs to several emails or to several compressed Zip files themselves attached as an attachment.
  • the index and overview tables are generated and then regularly updated using converters (marked Conv) which, from the starting documents (word processing, spreadsheets, presentations, e-mails, etc.) extract all the useful elements for consulting these tables when searching for information, and subsequently for displaying the results in the form of an overview.
  • converters marked Conv
  • search software starts by scanning a file index table on the hard drive of the computer. , commonly known as FAT, or an equivalent table to check whether the file name, file type, length or date meet the search criteria. If this is the case, and in the case where the search must be carried out on words contained in the documents themselves, then the contents of each of the documents which correspond to these first search criteria are scanned sequentially, to check if the words you are looking for appear in this document.
  • FAT file index table on the hard drive of the computer.
  • one begins by scanning the index table of the TIA-Att attachments and, each time an attachment contains the word or words sought, an identifier is temporarily stored in a table of this attachment, which makes it possible, subsequently, during a scan of the TIA-Mail electronic mail table, to identify the letters that have attachments containing the searched words.
  • the information relating to the documents selected at the end of the search is displayed in the form of a table known as the table of documents found, comprising one or more rows for each document found and several columns each corresponding to one or more of said headings.
  • a row in the table is selected, for example an email
  • the Tia-txt content of this email is extracted from the TIA index and preview table and then displayed in a separate window called the preview window.
  • you move to the next line of the table it is the content of this new mail that is displayed in the overview window.
  • This container file like a mail folder, can be transmitted to another person either as a file via an internal company network, or as an attachment attached to an electronic mail.
  • the recipient will be able to see the content of this container file, displayed in the form of a table, in a similar way to the table of documents found, each line of the container file corresponding to one line of the table of documents found.
  • the container file can in turn be modified or enriched with other documents, then transmitted to other recipients.
  • When used as an attachment to an email it can, in turn, be crawled by the search engine, and the search results can be inserted into a new container file.
  • the information relating to the documents found at the end of the search is displayed in the form of a preview comprising a preview zone for each heading and comprising a list of documents initially selected for the information they contain deemed relevant according to of research. More precisely, they are displayed for example in the form of a table comprising one or more rows for each document selected and several columns each corresponding to one or more of said headings.
  • Figure 3 shows an example of a search result in e-mails in which the lines L1, L2, L3 and L4 contain a sequence of characters searched for "Paris".
  • each column includes both the title of the corresponding heading, as well as a check box or an equivalent device operating as follows: - if the box is checked, the column is activated and all the lines which contain the word or words sought in the section corresponding to this column, are displayed, - if not, the lines which contain the word or words sought are hidden which appear only in the section corresponding to the column.
  • the lines which contain the relevant information namely the sequence "Paris”
  • only the lines which contain the sequence sought in at least one of the activated columns are displayed, which is different from the device Classic tab consisting of displaying only the lines which contain a sequence sought in a given section.
  • the display in the preview window shows only the plain text of a selected document, exactly like the emails in plain format, that is to say without its layout elements, neither color nor words underlined or displayed in bold, while it may be desirable to display these previews with an improved presentation, close to or equivalent to the initial presentation of the selected document, Furthermore, this process is not entirely satisfactory when doing research on words with accents: indeed if we search for the word "improved", documents containing only "improves" will not be detected, In some cases, we would also like to find documents from a synonym, or d 'an equivalent concept, for example' finance 'instead of
  • a tag includes at least one escape character, preferably outside the displayable characters appearing in the first 128 positions of the ASCII coding table, such as 0x1 (hexadecimal notation), 0x2, 0x80, ... (this character contains both a notion of tag type and a notion of tag length).
  • escape character preferably outside the displayable characters appearing in the first 128 positions of the ASCII coding table, such as 0x1 (hexadecimal notation), 0x2, 0x80, ... ( this character contains both a notion of tag type and a notion of tag length).
  • it can also include one or more several characters, preferably different dm zero 0x0, which is traditionally reserved at the end of a character string.
  • tags are used, called respectively: - formatting tags, - advanced search tags, - process launch tags, - formatting or alert tags.
  • - formatting tags - advanced search tags
  • - process launch tags - formatting or alert tags.
  • tags are used to insert layout information. For example to display the word “horizontal” we will use the sequence: “ho-0x8-Griz-0x8-So-Ox8-gnt -0x8-sa -I”, in which: - the escape character “0x8” means " start or end tag "with a tag length of 2 characters (escape character included), - the next character” G “corresponds to” start of bold “,” g “to” end of bold “,” S »To“ start of underline ”,“ s ”to“ end of underline ”(the characters“ - ”have been added for easier understanding, but do not appear in the character string of the table representing stored documents).
  • Tags of this type can also be used to change the font, the font size, indent paragraphs, change the line spacing, indicate a page change, etc.
  • a set of tags using 2, 3 or more characters allows starting from an MS Word or Acrobat Reader PDF document, to create a sequence of characters which allows both: - a quick scan, like this is specified below, - the generation of a file in “rtf” format substantially equivalent to the starting document, which in most cases avoids keeping both the overview table and the starting MS Word file.
  • MS Word, Visual C ++, WinSdk, MSN, rtf are formats and trademarks registered by Microsoft Inc.
  • Acrobat Reader PDF is a trademark registered by Adobe Inc.
  • Advanced search tags Use of tags for accentuation. It is useful to be able to search on a word, taking accents into account. For example, if you search with the word "andré”, it is useful to be able to find documents that contain the word without an accent, for example an e-mail address such as “andre.dupont@xxx.com”, or well with a misspelling: "andrè”. This information can be coded as follows: “andr-é-0x7-e-0x7-è", the tag "0x7” signifying that the following character (“e” or "è”) is equivalent to the previous one (“ é ”). 2) Use of tags to repeat the same character n times.
  • tags we can solve the problem with tags as follows: first, in the search string, we replace the sequences of spaces, with a single space or better with the non-displayable character 0x1, and in the string with scan, we perform the following conversion: - for sequences of spaces less than 6 characters, we use tags using a single character, namely 0-x1, 0x2, 0x3, 0x4, 0x5 (without any other character after) which allows with a single character to solve this very common problem when a text is displayed with the justification on the right and on the left.
  • the representation table will be enriched with tags and words making it easier to perform the other content analysis operations, this enrichment being able to be carried out when an element of the table is created. representation, or when creating a "secondary table of representations”.
  • tags for metadata.
  • the tag 0x15 is of a similar nature and also makes it possible to associate a concept such as the action of financing. In this way during the initial creation, or subsequently during the creation of a "secondary table of representations", it is possible to add to a document a whole series of metadata to allow intelligent search on the content. 5) Use of tags for phonetic writing. If you want to interface the search with a voice recognition module, or to facilitate automatic analysis, it is useful to use phonetics.
  • beacon system which takes this into account: "0x3-1 -0-0-0-0-0-0x4-1 -.- 0-0-0 -, - 0-O-0x5-1 -, - O-0-0 -.- O-0-0x6 ".
  • the tag 0x3 indicates that the next field is an amount expressed in cents.
  • the tag 0x4 indicates that the next field is an amount displayed with European conventions.
  • the tag 0x5 indicates that the next field is an amount displayed with the American conventions.
  • the tag 0x6 indicates the end of the area relating to this amount.
  • each of the four characters following the tag can take any value, including the binary zero which usually signals the end of a character string.
  • This coding mode can be used for any type of digital information, signed or unsigned, on 16, 64, 128 bits, floating point, etc.
  • the comparison between two zones can consist in testing the equality between these two zones, but in a general way, one can carry out all the logical operations between two numerical zones (smaller, larger, or logical, or exclusive, etc. ).
  • the information will be stored: - either in rather text form, as explained above.
  • Tags can specify the display mode, whether it is a date expressed in local time, or better in universal time.
  • Process launch tags Use of tags to trigger an analysis process.
  • a correlation between the criteria provided by the user and the presence of certain words in the document can activate a process of content analysis.
  • tags can be entered on the fly, and duplicated in a memory area for further processing to analyze the content and allow a more relevant search. More generally, we can use tags to give specific meanings to certain fields, such as an account number, a quantity, an amount, a date, an article code, a pointer to an object, a notion of hierarchy, of parent, child, brother, that is to say all the notions that can be found in a table or a file in a computer containing a succession of records of different types.
  • each type of record that is to say each type of document stored in the computer, can be associated with a set of tags with specific meanings.
  • a complex operation for example to edit a bank account statement, involving several pieces of information such as the name and address of the bank account holder, the list of all movements for a period, we can be required to consult several different tables representing the stored documents, and the meaning of the tags may change during the different phases of this operation.
  • One way to solve the problem is to store, either at the level of the representation table itself, or at the level of each record of the representation table, information (or a code) making it possible to know the meaning of all the set of tags that should be used at some point.
  • a tag followed by a 32-bit numeric zone corresponding to a length L to indicate that the following L characters correspond to a zone without text, for example an image in such or such format, a sound, a sequence of image, a compressed or coded area in ".zip" format, a sequence of bytes, an MS Excel table, and in general a sequence of characters on which there is no search. You can also use tags to delimit different coding areas.
  • a representation table as described above can be used in several ways: - launch an identical search: we ignore all the fields designated by the tags: it is by example a default mode of use; - display a document in a preview window, or reconstruct the original document: for this, we will ignore all the tags, except those for formatting; - launch a more sophisticated search, with a capacity to interpret the document: we will use all the advanced search tags, including the process launch tags useful for implementing the most advanced known techniques in this field; - finally, in a completely different field, thanks to all of these techniques, use this table as a real database with fields of all kinds, numeric type zones, stored in decimal or hexadecimal form, pointers, areas to start processes, etc.
  • API from the English “Application Program Interface”.
  • An example of a non-exhaustive list of these APIs is given below, namely: - StrStrEx, by analogy with the function "strstr" which exists in most programming languages, and which consists in searching in a string of characters, the next occurrence of a given substring; - ExtractEdit, to extract from a string, the text to be edited with the only tags relating to the layout (the case where we want plain text without any tag is a special case of this); - ExtractData, to extract data from a string to a set of fields according to the formats usually used in IT (32-bit or 64-bit integer, floating point format, etc.); - MakeEditStr, reverse operation of ExtractEdit to convert a set of text documents (such as MS Word, rtf, etc., or emails in raw or html format) into a representation table with formatting tags, and possibly those allowing research based on content analysis; - MakeEditStr, reverse operation of ExtractEdit to convert a set of
  • LPCSTR StrStrEx (LPCSTR ptrStart, LPCSTR ptrSubChain, UINT uiParameter, STRSTREX * strExtended) in which: LPCSTR ptrStart is the starting point in the chain to explore, LPCSTR ptrSubChain the substring sought. UINT uiParameter the scan mode, STRSTREX * strExtended the address of a structure used to specify data, conversion formats or to communicate with other processes.
  • the scanning mode is a set of 32 bits or more which, combined, specify how the character string should be interpreted.
  • - STREX_SKIP_BAL -1 Ignore the case and all the tags
  • - STREX_WITH_CASE 1 Respect the case
  • - STREX_SKIP_EDIT 2 Ignore the tags relating to the layout
  • - STREX_SKIP_ANALYSIS 4 Ignore the tags for advanced search
  • - STREX_SKIP_PROCESS 8 Ignore process launches
  • - STREX_SKIP_FORMAT 16 Ignore formatting tags
  • - STREX_ FAST_DUPLIC 32 Duplicate certain words on the fly
  • - STREX_ ANALYSIS_1 64 Use advanced search tags type 1, - STREX_ ANALYS Use type 2 advanced search tags, - etc.
  • strExtended is the address of a structure allowing to specify data, conversion formats or to communicate with other processes, as does the BROWSEINFO structure used by the known API SHBrowseForFolder (cf. WinSdk of Visual C ++).
  • the command “0x17-password-1 -0x17” can launch an authentication program designated in a “Callback” type command.
  • the returned value is: - a pointer to the next occurrence found, - 0x0 if no string was found, or - a symbolic value in the event of an error.
  • the StrStrEx function must make the best use of the characteristics of modern microprocessors and the possibilities offered by electronic component technology.
  • ExtractEdit function Description of the ExtractEdit function and operating mode. int ExtractEdit (LPCSTR ptrStart, LPSTR * ptrEditChain, UINT uiParameter STRSTREX_ED * strEditlnfo) in which: LPCSTR ptrStart is the address of the chain to extract, LPSTR * ptrEditChain the address of a pointer to the chain to be edited, UINT uiParameter the editing mode (no layout, layout for display, layout to restore an MS Word document in rtf format, etc.), STRSTREX_ED * strEditlnfo the address of a structure to communicate more information on the conversion mode and format.
  • the ExtractEdit function uses a large part of the elements of StrStrEx.
  • ExtractData (LPCSTR ptrStart, void * ptrExtractedData, STRSTREX_EXTRACT * strExtractlnfo) in which: LPCSTR ptrStart is the address of the string to extract, LPSTR * ptrExtractedData the address of a pointer to the object to be created, STRSTREX_ EXTRACT * strExtractlnfo the address of a structure to communicate the format of the object to be manufactured, and all the processing necessary to carry out the conversion.
  • the ExtractData function uses a large part of the elements of StrStrEx.
  • MakeEditStr and makeDataStr are essentially conversion programs which do not pose any particular problem for a person skilled in the art.
  • LPCSTR StrStrExMultiple (LPCSTR ptrStart, LPCSTR * ptrSubChain, STRSTREX_MUL * strExtended) in which: LPCSTR ptrStart is the starting point in the chain to explore, LPCSTR * ptrSubChain a set of substrings searched for, STRSTREX_MUL * strExtended the address of a used to specify the parameters of this function. The returned value is: - a pointer to the next occurrence found, - 0x0 if no string was found, or - a symbolic value in the event of an error.
  • the StrStrExMultiple function is used to handle the case of a multiple document such as an email.
  • An email contains information about the sender, the recipients, the people copying it, the subject, the content of the email, as well as other information, and this email is stored in the overview table under the form of a header, followed by the various sender channels, recipients, people in copy, subject and content of the electronic mail, said header itself comprising a start tag, and said other information.
  • InitStrStrEx Description of the InitStrStrEx function and operating mode.
  • InitStrStrEx (STRSTREX_BALISES * strBalises, STRSTREX_PROCESS * strProcess, STRSTREX_CONV_CHAR * strConvChar, STRSTREX_MISC * strMisc) in which: STRSTREX_BALISES * strBalises is the address of a structure specifying the values of escapes, their category of escapements, on page ...) their action, links with processing, etc., STRSTREX_PROCESS * strProcess is the address of a structure specifying the information for resolving links with external or internal processing used by StrStrEx and other APIs described above, STRSTREX_CONV_CHAR * strConvChar is the address of a structure specifying the list of characters used, Unicode, Ascii, etc., the conversion tables between these codifications, the rules for passing from capital letters to lowercase, etc.
  • strMisc is the address of a structure specifying the other data such as version, languages, programming languages, system of 'exploitation (Windows, Unix, Linux ...), coding conventions (xml, rtf, MS Word, etc.), limits in processor speed, memory size, integer size, etc. This function is generally launched at the start of any program execution using the StrStrEx API and its derivatives.
  • this library can be integrated into other applications to build a search engine based on the scanning technique of a representation table as described above, which has the particularity of: - being able to integrate a preview window whose content is extracted from said table, and - thanks to the layout tags, in addition to offering a presentation equivalent to the starting documents in the majority of cases.
  • This library can also be integrated into other applications to build or analyze a container grouping together: - documents containing text such as MS Word or PDF coming from a user's local disk or local network, - e-mails with their attachments, i.e.
  • This space saving is very useful, both to save information on disk, to generate backups, to build archives for e-mails, to transport this information on local networks or via the Internet in the form of attachments. in emails. This avoids that many users of large companies are forced to delete their emails older than 6 or 12 months, which is a major annoyance for them.
  • This library can also be integrated into other applications to build the different elements of a messaging software to: - integrate a search engine with the characteristics described above, and - offer a new attachment system using a container described above.
  • This library can also be integrated into other applications to build databases containing essentially non-modifiable information as seen in the example below.
  • a bank has a million customers, and all e-mails including attachments, letters or documents specific to a customer represent on average twenty thousand characters (or about ten full pages). All of this data, with the tags for the layout, plus the identifiers (agency code, account number, dates, specific texts, references various letters, e-mail addresses, etc.) and the corresponding formatting tags, represents a maximum of 32 KB.
  • a customer counts on average about twenty movements per month, and it takes on average about a hundred characters to describe a movement accountant: agency code, transaction code, account number, dates, amount, associated text such as "transfer to Mr. So-and-so" or "check No 12345", printed number used to print the account statement.
  • the set of movements of a client during a year, with the corresponding tags represents a maximum of 32 KB.
  • the set of all this non-modifiable information, namely all text documents in the life of a client as well as all accounting movements for a year represents 64 GB, which could easily fit in the hard drive of a simple microcomputer.
  • StrStrEx and in particular the sequence of instructions which makes it possible to ignore the characters without interest as in the example below: if one searches for the substring "information”, it is necessary to traverse the chain as quickly as possible, while layout tags, until you find an uppercase or lowercase "i", and when you find one, quickly determine if the next useful character is an uppercase or lowercase "n".
  • microprocessor supporting FPGA technology (from the English "Field Programmable Gate Array”) and create the succession of logic gates corresponding to the part of the StrStrEx function which must be executed very quickly.
  • FPGA Field Programmable Gate Array
  • Another possibility is to use a microprocessor which is capable, in a few clock cycles, of executing a sequence of several tens, or hundreds, or thousands of instructions which are not stored in the memory of the machine, and loaded each time in the cache memory of the microprocessor, but engraved at least in part in the microprocessor itself, in the manner of specialized components such as graphics processors which allow the rapid display of a high definition image.
  • At least part of the API library can either be added to an existing microprocessor, which makes it possible to obtain a rapid scan with a simple microcomputer, for example to perform searches in e-mails, either be placed in a separate microprocessor, called the Co-Pi co-processor, which accesses the machine's memory, and executes its instructions under the control of another MainProc master microprocessor, as does the graphics processor of a microphone - computer (see Figure 5). It is also useful to place one or more dictionary tables in the microprocessor, in order to speed up the grammatical analysis of a document.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)
EP05742860A 2004-03-23 2005-03-18 Verfahren zum auffinden von daten, suchmaschine und mikroprozessor dafür Withdrawn EP1733324A1 (de)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
FR0402939A FR2868178B1 (fr) 2004-03-23 2004-03-23 Moteur de recherche pour les documents texte stockes dans les microordinateurs
FR0409271A FR2874719B1 (fr) 2004-09-02 2004-09-02 Procede de recherche et d'affichage de la recherche parmi les documents texte stockes dans les ordinateurs
FR0502604A FR2870023B1 (fr) 2004-03-23 2005-03-16 Procede de recherche d'informations, moteur de recherche et microprocesseur pour la mise en oeuvre du procede
PCT/FR2005/000659 WO2005101240A1 (fr) 2004-03-23 2005-03-18 Procede de recherche d'informations, moteur de recherche et microprocesseur pour la mise en oeuvre de ce procede

Publications (1)

Publication Number Publication Date
EP1733324A1 true EP1733324A1 (de) 2006-12-20

Family

ID=35456166

Family Applications (1)

Application Number Title Priority Date Filing Date
EP05742860A Withdrawn EP1733324A1 (de) 2004-03-23 2005-03-18 Verfahren zum auffinden von daten, suchmaschine und mikroprozessor dafür

Country Status (4)

Country Link
US (1) US20070179932A1 (de)
EP (1) EP1733324A1 (de)
FR (1) FR2870023B1 (de)
WO (1) WO2005101240A1 (de)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104065681A (zh) * 2013-03-20 2014-09-24 腾讯科技(深圳)有限公司 对附件中的加密压缩包进行预览的方法和系统

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1269357A4 (de) * 2000-02-22 2005-10-12 Metacarta Inc Räumliches codieren und anzeigen von informationen
US8200676B2 (en) 2005-06-28 2012-06-12 Nokia Corporation User interface for geographic search
US8650652B2 (en) * 2005-09-26 2014-02-11 Blackberry Limited Rendering subject identification on protected messages lacking such identification
CN100356370C (zh) * 2005-12-15 2007-12-19 无锡永中科技有限公司 提高文字处理文档打开速度的处理方法
AU2007215162A1 (en) 2006-02-10 2007-08-23 Nokia Corporation Systems and methods for spatial thumbnails and companion maps for media objects
WO2007146298A2 (en) 2006-06-12 2007-12-21 Metacarta, Inc. Systems and methods for hierarchical organization and presentation of geographic search results
US20080040336A1 (en) * 2006-08-04 2008-02-14 Metacarta, Inc. Systems and methods for presenting results of geographic text searches
US9721157B2 (en) 2006-08-04 2017-08-01 Nokia Technologies Oy Systems and methods for obtaining and using information from map images
US20080133502A1 (en) * 2006-12-01 2008-06-05 Elena Gurevich System and method for utilizing multiple values of a search criteria
WO2009075689A2 (en) 2006-12-21 2009-06-18 Metacarta, Inc. Methods of systems of using geographic meta-metadata in information retrieval and document displays
US8046353B2 (en) * 2007-11-02 2011-10-25 Citrix Online Llc Method and apparatus for searching a hierarchical database and an unstructured database with a single search query
JP5235798B2 (ja) * 2009-06-22 2013-07-10 富士フイルム株式会社 撮影装置及びその制御方法
US10157223B2 (en) * 2016-03-15 2018-12-18 Accenture Global Solutions Limited Identifying trends associated with topics from natural language text
US20190294385A1 (en) * 2018-03-22 2019-09-26 Xerox Corporation Method and system for arranging and printing pages according to search criteria

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2715486B1 (fr) * 1994-01-21 1996-03-29 Alain Nicolas Piaton Procédé de comparaison de fichiers informatiques.
JPH1115759A (ja) * 1997-06-16 1999-01-22 Digital Equip Corp <Dec> 全テキストインデックス型のメール保存装置
US6493703B1 (en) * 1999-05-11 2002-12-10 Prophet Financial Systems System and method for implementing intelligent online community message board
US7178099B2 (en) * 2001-01-23 2007-02-13 Inxight Software, Inc. Meta-content analysis and annotation of email and other electronic documents
US20020103867A1 (en) * 2001-01-29 2002-08-01 Theo Schilter Method and system for matching and exchanging unsorted messages via a communications network
EP1368739A4 (de) * 2001-02-12 2007-07-04 Emc Corp System und verfahren zur indizierung einzigartiger e-mail-nachrichten und verwendungen dafür
US7162483B2 (en) * 2001-07-16 2007-01-09 Friman Shlomo E Method and apparatus for searching multiple data element type files
FR2827686B1 (fr) * 2001-07-19 2004-05-28 Schneider Automation Utilisation d'hyperliens dans un programme d'une application d'automatisme et station de programmation d'une telle application
US6785681B2 (en) * 2001-07-31 2004-08-31 Intel Corporation Generating a list of people relevant to a task
FR2845789B1 (fr) * 2002-10-09 2006-10-13 France Telecom Systeme et procede de traitement et de visualisation des resultats de recherches effectuees par un moteur de recherche a base d'indexation, modele d'interface et meta-modele correspondants
WO2006011819A1 (en) * 2004-07-30 2006-02-02 Eurekster, Inc. Adaptive search engine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2005101240A1 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104065681A (zh) * 2013-03-20 2014-09-24 腾讯科技(深圳)有限公司 对附件中的加密压缩包进行预览的方法和系统
CN104065681B (zh) * 2013-03-20 2018-06-15 腾讯科技(深圳)有限公司 对附件中的加密压缩包进行预览的方法和系统

Also Published As

Publication number Publication date
WO2005101240A1 (fr) 2005-10-27
FR2870023A1 (fr) 2005-11-11
FR2870023B1 (fr) 2007-02-23
US20070179932A1 (en) 2007-08-02

Similar Documents

Publication Publication Date Title
EP1733324A1 (de) Verfahren zum auffinden von daten, suchmaschine und mikroprozessor dafür
Svenonius The intellectual foundation of information organization
US7788262B1 (en) Method and system for creating context based summary
CN102053991B (zh) 用于多语言文档检索的方法及系统
Mäkelä et al. Wrangling with Non-Standard Data.
FR2975201A1 (fr) Analyse de texte utilisant des proprietes de listes linguistiques et non-linguistiques
US20130080152A1 (en) Linguistically-adapted structural query annotation
JP6902945B2 (ja) テキスト要約システム
WO2002067142A2 (fr) Dispositif d&#39;extraction d&#39;informations d&#39;un texte a base de connaissances
EP2601573A1 (de) Verfahren und system zur integration web-basierter systeme in lokale dokumentenverarbeitungsanwendungen
US20100185438A1 (en) Method of creating a dictionary
Hayes Bit lit
Nagy et al. Noun compound and named entity recognition and their usability in keyphrase extraction
FR2986882A1 (fr) Procede d&#39;identification d&#39;un ensemble de phrases d&#39;un document numerique, procede de generation d&#39;un document numerique, dispositif associe
JPS61248160A (ja) 文書情報登録方式
Vázquez-González et al. Creating a corpus of historical documents for emotions identification
US11783112B1 (en) Framework agnostic summarization of multi-channel communication
Muhundan et al. Collection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019
JP7116940B2 (ja) オープンデータを効率的に構造化し補正する方法及びプログラム
Thottempudi A visual narrative of ramayana using extractive summarization topic modeling and named entity recognition
TWI703453B (zh) 建議詞語生成裝置、記錄有建議詞語生成程式之電腦可讀取之記錄媒體及建議詞語生成方法
Thanh Machine translation of proper names from english and french into vietnamese: an error analysis and some proposed solutions
TW201005557A (en) Translation system by words capturing and method thereof
WO2020229760A1 (fr) Procede d&#39;indexation multidimensionnelle de contenus textuels
FR3041125A1 (fr) Generateur automatique de document de synthese et moteur de recherche l&#39;utilisant

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20061020

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU MC NL PL PT RO SE SI SK TR

DAX Request for extension of the european patent (deleted)
17Q First examination report despatched

Effective date: 20080603

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20081001