CN1151558A - Information searching method and system - Google Patents

Information searching method and system Download PDF

Info

Publication number
CN1151558A
CN1151558A CN 95118142 CN95118142A CN1151558A CN 1151558 A CN1151558 A CN 1151558A CN 95118142 CN95118142 CN 95118142 CN 95118142 A CN95118142 A CN 95118142A CN 1151558 A CN1151558 A CN 1151558A
Authority
CN
China
Prior art keywords
character string
document
valid matching
search
string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 95118142
Other languages
Chinese (zh)
Inventor
久保田理惠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN1151558A publication Critical patent/CN1151558A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided is a character string retrieval method which can retrieve a document resembled to the sense of a man with fuzzy retrieval. A system for retrieving the document containing a character string whose arrangement of characters is similar to that of a designated character string at high speed by using an index file containing intra-document position information on a character string pattern is provided. In the system, the character string to be retrieved and retrieval precision (more than zero and not exceeding one) are designated and the document containing the 'resemble character string' whose 'resemble degree' with the character string to be retrieved is more than designated retrieval precision and the intra-document position of the 'resemble character string' can be specified.

Description

Information retrieval method and system
The present invention relates to a system and method for searching at high speed and with a required degree of permissible ambiguity, for example, for a large number of documents stored in the form of text files in a disk.
Conventionally, it has been desired to search a large amount of literature such as news reports written in natural language, patent publications, and scientific technical documents stored in a magnetic disk at high speed, and various search methods have been proposed. These search methods are roughly classified as follows. (a) Keyword search method
In this method, an index is created in advance for each document and a keyword indicating the content of the document. In this case, the method of determining the keyword includes an automatic keyword detection method such as font decomposition, a manual keyword typing method, and a combination method of the two. However, this method is disadvantageous in that it can only search for a character string with a keyword index, and the accuracy of automatically detecting a keyword by font decomposition depends on the accuracy of a dictionary of words and grammar, so that much labor is required for dictionary preparation. (b) Non-index full-text retrieval mode
This is a way of scanning the full text of a document that should designate a character string to be retrieved as a retrieval target each time, although no index is used. It is also a way to use special hardware to increase the speed of retrieval. However, systems that use special hardware are subject to increased costs and customer service environment constraints, and are limited to certain models that can be used. (c) Full-text retrieval mode according to index
The present invention belongs to a full text retrieval mode according to indexes. This is an attempt to achieve high-speed retrieval of full texts by using an index, and several technical methods are known as shown below.
Japanese patent application laid-open No. 4-205560 discloses that a character string to be searched is divided into search units used for searching, ascending symbols are added to each of the search units, attribute symbols indicating logical divisions of the divided search units are added to the divided search units, and character position order symbols indicating positions of the respective characters in the search units are added to the character string to be searched, and character position information including a search unit identification symbol, a character position order symbol, and an attribute symbol is created, and the character position information is stored in each character area to create a search file.
Japanese patent laid-open publication No. 4-215181 discloses: in order to reduce the number of checks on a character string during search processing and to enable high-speed checks using a general-purpose information processing device, character group position information indicating the positions of character groups constituting a search target character string in the character string is organized into search files grouped by the type of each character group.
However, it is also often the case that a search is performed not only for a string that completely matches the search string, but also for documents that contain a partially matching string. For example, the user may have a blurred memory of the search string, or may have various variations of the search string, and it may be difficult to list all of the variations.
The typical prior art method for specifying a partial character string is to use a standard expression. According to this method, an arbitrary character which repeatedly appears 0 times or more, an arbitrary character which repeatedly appears 1 time or more, an arbitrary character at a line end position, a line head position, and within an absolute character code range, and the like can be specified.
Further, japanese patent application laid-open No. 63-99830 discloses that a system having a function of partially matching search string data with retrieved string data is provided with a data table storing data indicating the relation of the same kind of language with the search string data and a data table indicating whether the search string data appears in any of the retrieved string data.
Japanese patent laid-open publication No. 62-221027 discloses: when a character string in which a part of the object character string is cut off from the beginning cannot be searched in the dictionary, the next cut-off character string is searched by adding 1 to the length of the character string, so that the invalid search frequency can be reduced, and the effective word reading can be performed at a high speed.
Japanese patent laid-open publication Nos. Hei 4-326164 and Hei 5-174067 disclose: and a search means for storing the autocorrelation information for each event of the search target in the database search system, obtaining the coincidence degree between the autocorrelation information of the search key and the autocorrelation information of the search target for each event, and outputting the event numbers in descending order of the coincidence degree.
However, in the above-described conventional character string search method, it is difficult to specify the degree of ambiguity of the character string to be searched, and many character strings included in the search result are not required by the user or are not logical character strings.
The invention provides a character string searching method capable of arbitrarily specifying the ambiguity of a character string to be searched.
Another object of the present invention is to provide an index structure for realizing a string search method capable of arbitrarily specifying the ambiguity of a string to be searched.
It is still another object of the present invention to provide a character string retrieval technique that can make retrieval close to human feeling by fuzzy retrieval.
According to the present invention, in order to perform a full-text search on a database composed of a plurality of documents, a unique number (or symbol) is added to each document, and information on each of N consecutive characters in each document, the document number where the N characters are located, and the position thereof in the document is stored in an index file. The index file is suitably composed of two files, a font file and a position information file. The font file stores the font and separator, and the document number and the position of the in-document position number in the position information file corresponding thereto. The document number and the in-document position number are stored in the position information file.
According to the present invention, there is provided a method for searching documents containing character strings similar to a specified character string and character arrangement at high speed using the index file. In this manner, a character string to be retrieved and retrieval accuracy (more than 0 and less than 1) can be specified, and documents containing "similar character strings" whose "similarity" to the character string to be retrieved exceeds the specified retrieval accuracy and the positions of the "similar character strings" in the documents can be determined.
This way, in particular, a string "similar to the string to be retrieved" is selected from the literature, and the "similarity" is numerically processed from the two viewpoints of which characters are consecutive and how many redundant characters are sandwiched therebetween.
At this time, if the highest value of the "similarity" is 1, it means that the character strings are completely identical, and when the character strings are completely identical, the "similarity" must be 1. When extra characters other than the character string to be retrieved are sandwiched in the similar character string or only a part of the character string to be retrieved appears in the similar character string, its "similarity" is a value smaller than 1, but according to the present invention, it is useful if such "similarity" value is well in conformity with human feeling.
Since the index file can search any continuous N characters in the document at high speed, by using the index file and continuously comparing the sequence of the N characters of the character string to be searched, it can be detected at high speed which characters are continuously consistent and how many redundant characters are sandwiched between the characters.
Fig. 1 is a block diagram showing a hardware configuration.
Fig. 2 is a block diagram of a processing unit.
Fig. 3 is a structural diagram of an index file.
Fig. 4 is a flowchart showing the index file creation process.
Fig. 5 is a flowchart of a character string retrieval process using an index file.
Fig. 6 is a flowchart of the fuzzy search processing using the index.
Embodiments of the present invention are described below with reference to the drawings. A. Hardware construction
Referring to FIG. 1, a simplified diagram of a system for implementing the present invention is shown. The configuration is a general configuration in which a Central Processing Unit (CPU)102 having an arithmetic and input/output control function, a main memory (RAM)104 for loading a program and providing a work area for the CPU102, a keyboard 106 for entering a command, a character string, and the like, a hard disk 108 for storing an operating system for controlling the central processing unit, a database file, a search tool, an index file, and the like, a display 110 for displaying a search result of the database, and a mouse 112 for designating an arbitrary position of a screen of the display 110 and transmitting the position information to the central processing unit are connected to a bus 101.
The operating system is preferably a standard-supported GUI multi-WINDOW environment such as X-Windows system (MIT trademark) upgraded with Windows (microsoft corporation trademark), OS/2(IBM trademark), AIX (IBM trademark), or the like, but the present invention may be implemented in a character-based environment such as PC-DOS (IBM trademark), MS-DOS (microsoft corporation registered trademark), or the like, and is not limited to a specific operating system environment.
Further, fig. 1 shows a system of an independent environment, but generally, since a large-capacity disk drive is required for a database file, if the present invention is applied to a client server system, the database file and a search tool can be arranged in a server, and a client can be connected to the server via a local area network such as an ethernet or a token ring, or only a display control section for viewing a search result can be arranged in the server. B. System configuration
Next, the system configuration of the present invention will be described with reference to the block diagram of fig. 2. It should be noted that the units represented by the respective block diagrams in fig. 2 are stored as data files or program files individually or in their entirety in the hard disk 108 of fig. 1.
The main idea of the present invention is to consider the database 202 as a plurality of documents for storing news reports, patent publication databases, and the like. However, it should be noted that the scope of application of the present invention is not limited to databases composed of a plurality of documents, and can also be applied to searches in a single document. The content of the individual documents is then stored retrievable, for example in the form of text files. In addition, a unique document number is attached to each document. The document number is preferably an ascending number from 1, but a unique document number such as an application number or a publication number may be used for the patent publication database. To identify each document, rather than being sequentially numbered, symbols such as "ABC", "and" XYZ "may be used. However, since the number of bytes required to represent such an identification symbol is usually larger than the number of digits, it is preferable to actually identify documents by a sequential number.
Since it takes a long time to directly search for enormous information contents such as news reports and patent publications stored in the database 202, the news report contents stored in the database 202 are usually previously organized into the index file 204 by the index generation/update module 206. In the embodiment of the present invention described later, the index file 204 composed in this manner is composed of two files, a font file and a position information file. The font file stores the font and separator, the document number corresponding thereto, and the position of the position number in the document in the position information file. The document number and the in-document position number are stored in the position information file.
The database 202 may also manage each document as a separate file, or may arrange the entire document sequentially into a single continuous file, in short, essentially attaching a unique number to each document and accessing the contents of each document by that unique number. In the former case, the database 202 manages a data table set in correspondence with the unique number of each document and the actual file name of the stored document, while in the latter case, the database 202 manages a data table set in correspondence with the unique number of each document and the amount of bit offset and document size in a single database file. The search tool 208 searches the index file 204 by using the search string from the search string input module 210 as an input, and has a function of returning a document number (which may be a plurality of digits) including the document in which the search string is input and a position (which may be a plurality of digits) of the input search string in the document. The search character input module 210 is preferably constituted by a dialog box of a multi-window environment and has a form of inputting a desired search character into the input box using the keyboard 106.
In addition, according to the features of the present invention, the search character input module 210 can input the similarity of the fuzzy search according to the numerical value of 0 to 1 (or according to the percentage of 0 to 100). To this end, the search character input module 210 displays a slider or scroll bar having a pointer indicating an arbitrary position between 0 and 1. The slider pointer may, for example, indicate a system setting of 1, or may be operated by the mouse 112 to drag the moving pointer to indicate another value.
The result display module 212 accesses the database 202 based on the document number of the search result from the search tool 208 and the value of the position where the search character appears in the document, and displays a row corresponding to the position in the document in an appropriate individual search result display window. When the search result cannot be accommodated in one screen of the window, a scroll bar is displayed, and the user can move the scroll bar to look at the search result in turn. C. Structure and making method of index file
In the invention, all the continuous N characters, the positions in the document and the segmentation information in the document are compiled into the document, and the document is formed by adding indexes. In the document, typical segmentation information in the document is counted ". Delimiters for articles such as "" and generalized document delimiters such as "Chapter 1", "Abstract", and the like. C1. Standardization of character strings
The initial processing necessary to index a file is to perform string normalization as described below. That is, particularly when the document to be searched is a japanese text document, and is a mixed document formed in a half-angle and full-angle manner. Therefore, a process of replacing the half-angle character with the corresponding full-angle character is performed. C2. Extraction of font information
The next step of indexing the document is to cut out the N consecutive characters (hereinafter, referred to as glyphs) from the beginning of all the characters of the normalized character string, and store the characters in the index file together with the document number and the in-document position number. However, N is equal to or greater than 1, and is preferably 2 in Japanese.
The intra-document position number is an intra-document unique sequence number added to all characters of the search target in the document. And the in-document position number of the first character of the font is used as the in-document position number of the font. When the number of subsequent characters at the end of the document is less than N in total, predetermined padding characters such as X '00' are loaded so that the total number is N.
In addition, in the present embodiment, each individual document is divided into blocks according to a dividing method meaningful for retrieval, and the division information is stored in an index file. The storage of the division information is performed in the same form as the font described above. That is, instead of storing the character patterns in a standardized character string, a specific separator is stored together with the document number of the document and the in-document position information of the block boundary character.
Since there are a plurality of types of delimiters, there can be a plurality of different segmentation methods. However, it is necessary to specify that the separator should not overlap with the font read in the standardized character string. In the present embodiment, when 1-byte code is converted into 2-byte code by the normalization processing, and thus 2 bytes are treated as 1 word, if the value of the 1 word is 255 or less, the common character encoding is not applicable. Therefore, arbitrary word values of 0 to 255 can be individually assigned to the plural types of delimiters.
The advantage of storing the segmentation information in the same form as the glyph is as follows.
The generation and the updating of the index are simple. No additional processing of the split information is required.
Without a significant increase in the capacity of the index.
For example, the increase in capacity is very small compared to the form in which each position number within the document is appended with the block number to which it belongs. C3. Specific examples of position numbering within documents
For example, there will be "本日は晴天なり. ただぃま, マイクのテスト中." (today sunny, now testing microphones.) the literature of such a section of paper is stored in database 202 (fig. 2). If the in-document position number is added to each character of the sentence, the following is shown.
In-document position number 12345678910111213141516171819202122 of character
Normalized string this day は sunny day なり. ただいま, マイク in the tuber of Sichuan lovage, テスト.
Partition mode 1 | purple
Partition mode 2 | | | non-conducting phosphor
The document number of this document is 1, and the number N of characters of the above-described character pattern is 2. In this way, the document number and the in-document position number are assigned for each font (length of 2) as follows. Position number in character-shaped document number document is は 12 on day 11, 12 は on sunny day 13, な 15 on day 14, 15 なり 16, なり and 16 り. 17.た separator 118 separator 218 ただ separator 218 だぃ 110 いま 111 ま, 112, マ separator 113 separator 2113 マイ 114 イク 115 ク at the periphery of the bone 116 of テテスストト. 121. 122 separator 1122 separator 2122 c4 role of dividing information within the document
The value of the intra-document segmentation information (partition) in the search will be described below. Retrieval targeting only specific blocks
For example, when a document is composed of a title, an abstract, and a body, it is generally desirable to search only for a specific part such as a title or an abstract. Such a search can be realized by storing delimiters and position information thereof for the end of the title and the end of the digest. Document retrieval with close correlation between multiple strings
It is generally desirable to retrieve in recognition of a physically close association between multiple strings.
For example, it is conceivable that the relationship between character strings is not as close as that in the same paragraph, but is more close in the same sentence, compared with that in the same document, and the documents existing in the same memory block can be searched by adding a separator to the end of the paragraph and the sentence and storing the separator and the position information thereof, so that the search which is aware of the close relationship can be performed. C5. Structure of index file
The font, the separator, and the document number and the intra-document position number thereof must be stored in a form that can be efficiently extracted at the time of search. For this reason, in the present embodiment, the index file is composed of a font file (a file that mainly stores fonts and delimiters) and a position information file (a file that mainly stores document numbers and intra-document position numbers). The font file stores the position of the font, the separator, and the document number and the in-document position number corresponding thereto in the position information file. The document number and the in-document position number are stored in the position information file.
An example of such a font file and a position information file corresponding thereto is shown in fig. 3.
In fig. 3, the entry of the glyph file 302 is a glyph of consecutive N characters (here, N ═ 2) in all documents in the database 202. To enable a binary search, the entries of the glyph file 302 are preferably sorted in ascending order by the top character code value of the standardized glyph. The items of the font file 302 such as "separator 1", "separator 2", "なり", "は" are individual items. For example, "separator 1" is ",". The symbol for dividing articles and sentences is collectively denoted by "2-byte value.
The position information file 304 of fig. 3 stores at least one document number and at least one document position number associated with each document number, corresponding to each item of the font file 302.
In order to make the items of the font file 302 and the items of the position information file 304 correspond to each other, although not listed in the figure, there should be item information in the position information file 304, a displacement amount from the head of the position information file 304, and information of the corresponding position information file 304 item size corresponding thereto in each item of the font file 302. That is, in fig. 3, for example, the font file 302 searches from the head of the position information file 304 based on the displacement amount information associated with the "separator 2" and stored therein, and reads the number of bytes specified only in the item size information from the searched position. Therefore, the intra-document position numbers 8, 13, and 22 … in document No. 1 relating to "separator 2" and the intra-document position number … (if any) relating to document No. 2 and the intra-document position number n relating to document No. n can be read together.
The in-document position number value relating to document number i is generally stored in the form of, for example, (document number i: 4 bytes) (number of in-document position numbers k: 4 bytes) (1 st in-document position number: 4 bytes) … (number of k in-document position numbers: 4 bytes). In this example, the field for storing the position number in the document stores the absolute position of the document and uses 4 bytes, but actually stores the absolute position of the document from the displacement amount of the position number in the previous document, and thus 1 to 3 bytes can be saved. C6. Indexing file generation process
The process of creating an index file will be described below with reference to fig. 4. This processing is performed by the index generation/update module 206 in fig. 2 when the database 202 is initially created, or when the database 202 is added to or deleted from the database 202.
In fig. 4, first, in step 402, a process of securing a memory area is performed. Such processing can obtain a work area of a prescribed size on the RAM104 by, for example, calling a function of the operating system.
At step 404, a document is read from the database 202 into the appropriate storage area obtained at step 402 above.
At step 406, the document read at step 404 is normalized.
At step 408, the normalized document is scanned to create a font separator, and the font separator is stored in the storage area obtained at step 402 together with the document number of the document and the in-document position number of the font separator.
In the processing of step 408, as the font, the document number, and the in-document position number are stored in the storage area obtained in advance in step 402, the free area of the obtained storage area may not be full yet. Therefore, in step 410, a process of checking whether or not the obtained storage area is full is performed, and if the obtained storage area is full, in step 412, the font and the delimiter stored in the storage area and the in-document location information of the font and the delimiter are classified based on, for example, the code value, the document number, and the in-document location number of the font and the delimiter, and written as an intermediate file to the disk 108 (fig. 1), and therefore, the storage area occupied by the data written as the intermediate file can be opened for use in the following process. And the following process proceeds to step 414.
If it is determined in step 410 that there is a margin in the storage area, the process proceeds directly to step 414.
At step 414, a determination is made as to whether any documents remain in database 202 that have not been read at step 404. If so, processing returns to step 404.
At step 414, if it is determined that all the documents in the database 202 have been read, the written-out fonts, separators, document numbers of the documents and in-document position information of the fonts and separators are classified according to the font types, the encoding values of the separators, the document numbers, and the in-document position information of the fonts and separators, and the classified fonts and separators are written to the disk 108 (fig. 1) as an intermediate file.
In the intermediate file writing processing in steps 412 and 416, a plurality of intermediate files are stored in the disk 108, and since the intermediate files have been classified in advance, the intermediate files are processed by a well-known merge classification technique in step 418, and the font file 302 and the position information file 304 shown in fig. 3 are created from the plurality of intermediate files and stored in the disk 108. Since there is a possibility that a font will appear repeatedly in a plurality of original intermediate files, the same font items that are repeated are merged into one item, and the document number and the intra-document position number associated with the merged item are subjected to the process of being correlated. D. Retrieval process using index files
An example of the search process using the index file thus created will be described below with reference to the flowchart of fig. 5. First, in step 502, a dialog box having an input box, for example, is displayed, an input process is prompted for the user, and a search string is input to the input box.
The user inputs a search string into the input box, clicks an OK button, and then performs a normalization process of the search string as necessary, and then performs a search process using the index file from a font of N characters from the top of the search string. Since the length of the N-character font is the same as the character string font length N of the index file, the index file can be searched at high speed by half-folding using, as a key, the N-character font of the partial character string of the search character string. An example of a suitable N value for the japanese document is to take N-2.
If it is determined in step 506 that the font of the first N characters of the search string cannot be found, the information that the search string cannot be found is appropriately displayed in the information box in step 508, and the process ends.
If it is determined in step 506 that the glyphs of the first N characters of the search string have been found, then at least one document number and at least one in-document position number of the document number are returned from the index file, and therefore, in step 510, the information is stored in a predetermined buffer on the main memory or the disk.
At step 512, it is determined whether the search string has been searched for all of the partial strings of the N-character glyphs, and if so, the process proceeds to step 520. If not, then at step 514, a search process is performed using the index file based on the glyph of the next N characters of the search string. Since the length of the search string is not generally limited to a multiple of N, when the process of searching for the glyphs of N characters one by one is performed until the partial string near the end of the search string, the string of the index file key may be shorter than the glyphs of N characters. It is desirable to retrieve a partial string of the last N characters of the string, as this is the case. In this way, the result is repeated with the N characters taken before that. When the search string is less than N characters, the search is performed in half with a plurality of candidate results, and the subsequent processing is to find out the plurality of candidate results through sequential search.
In step 516, it is determined whether the glyph corresponding to the N characters of the search string is found in the index file, as in step 506. However, step 516 is different from step 506 in nature, and in step 516, the meaning found and not found means: what is sought is a glyph in an in-document position number having an in-document position number that is associated with the first N characters of the search string, plus only N in the in-document position number in the document number.
If it is judged at step 516 that the glyphs of the N characters of the search string cannot be found, information that the search string cannot be found is displayed in the information box at step 508, and the process ends.
If it is determined at step 516 that the font of the first N characters of the search string is found, information on the document number returned from the search result of the index file and the font of the first N characters in at least one in-document position number in the document number and the position number in the same document are sequentially circulated, and at step 518, the information is stored in a predetermined buffer on the main memory or the disk for the subsequent processing.
If it is determined in step 512 that the search string has been completely searched, the process proceeds to step 520, where the document number and the position of the search string are identified from the document number and the in-document position number stored in the buffer, and in step 522, the stored contents of the database 202 are accessed using the document number and the in-document position number, and the line of the document in which the search string is present is displayed in another window as appropriate.
In order to check whether the search string appears in a specific block (for example, block 3) in the document, the partition position in the document where the search string appears before the appearance position in the document should be calculated, and thereby it is checked which block (block 3) the search string is located in the document, and the number of the specified block may be compared. E. Fuzzy search processing
While the processing using the index file shown in fig. 5 is rigorous search processing, the present invention includes a processing using an index file that can perform a high-speed search, called fuzzy search, on each file of the database in accordance with a specified character string and a character string having a similar arrangement to the character string. In particular, in this manner, a character string to be retrieved and retrieval accuracy (more than 0 and less than 1) can be specified, and documents containing "similar character strings" whose specific "similarity" to the character string to be retrieved exceeds the specified retrieval accuracy and the positions of the "similar character strings" within the documents can be determined. E1. Determining similarity of character strings by human sense
In the sense of the person who knows japanese, the following cases are found for japanese character strings having similar arrangement and meaning. (1) The expression mode of katakana is different from that of small font and large font "ソフトウエフ" "ソフトウエア" (software) with or without long sound "-" "コンパイラ -" "コンパイラ" (programmer) with or without center dot "-" "アイビ - エム" "アイ. ビ - & エム" (IBM) other "ビルデインゲ" "ビルヂング" (building) (2) inserting auxiliary words between Chinese character phrases and Chinese character phrases, etc
"call out at home" or "call out at the position of home まま"
"political re-compilation" and "political re-compilation" (3) Chinese character phrase compound words and phrase compound words lacking a part
"national museum", "national museum" and "national museum" (4) lack part of characters due to ellipses
"development of ソフトウエア", "development of ソフト" (development of software) (5) misspelling of foreign language
"カリフオルニア" "カリフオリニア" (California)
The common feature of the above is that the characters are substantially continuous and consistent, but there are either missing or extra characters.
As several words are studied from a viewpoint similar in terms, the arrangement similar to "ソフト メ - カ" is counted as "ソフト s メ - カ -", "ソフト is developed as メ - カ -", "ソフト s メ - カ -", whereas the arrangement similar thereto is perceived as "political fund regulation law", "political fund regulation", "political fund" if compared with "political fund regulation law".
Although the characters can be said to be identical, it is said that "メ - カ -" of main business とする machine manufactured by を machine manufactured by ソフト ク リ - ム machine (machine manufacturer mainly operated by dairy production machine) is a character string similar to "ソフト メ - カ -" and thus has an unrealistic feeling.
The feeling of whether a person feels a string similar can be summarized as (a) the more characters that are consecutively identical the more they feel similar, (B) the more inconsistent characters that are sandwiched in the middle the more they feel dissimilar, and (C) the more inconsistent characters that are sandwiched in the middle the more it feels not a string.
At this time, a specific case where the input character string repeatedly appears at a close position in the document must be considered. For example, the input character string is "the science division length に ren", and the literature is "the science division length に ren". Although one of the characters of the repeated "section" is an unnecessary character, the idea that the former is a close matching character is considered to be appropriate, compared with the word "peripheral" of the irrelevant word "the" person who is the highest in science section に. E2. Structure and consistency of index file
The index file structure shown in fig. 3 is an index in which a document number and an intra-document position number are added to a font of N characters, and the search processing is performed in units of one font, and the document number and the intra-document position number are detected. However, when searching for a character string of less than N characters, it is necessary to perform search processing from the head of the character string to be searched, with all characters of the font as the minimum unit, and the number of characters may be considerable. The search load of an input character string having less than N characters is larger than that of a search in which the number of searches for N or more characters in the input character string is at most the minimum number of characters in the input character string.
Therefore, it is considered appropriate to say that the coincidence of the portion less than N characters should be discarded and the similar character string is determined from the portion where N characters consecutively coincide. E3. Determination rule of similar character string and similarity
The outline of the rule is that similar character strings having the same sequence relationship as the input character string and relatively close to each other are collected from character strings having continuous coincidence with the input character string of more than M characters, and the similarity is calculated based on the number of coincident characters and the number of non-coincident characters.
First, terms used in the description are defined.
Consistent character string:
the character string to be searched has a part which is continuously consistent with the original text of the document above M characters. The largest length is selected starting from the same character.
(example) string to be retrieved: political capital regulation act
Original documents of the literature: …, Do you Fu Zi カで … of all kinds of diseases of lower energizer regulation
Let M be 2. Thus, "fund regulation" is a consistent string. At this time, the "fund" and "fund rule" cannot be referred to as a consistent string because the longest is to be selected. Whereas the "law" does not belong to a consistent string because it is less than 2 characters.
Valid consistent strings:
a consistent string of similar strings.
Maximum discordant string length L:
the non-uniform characters contained in the similar character strings are continuous to L characters. L is a constant of 1 or more.
The following describes a selection method of "similar character string" and a digitization method of "similarity". (1) Determination of the 1 st valid string of correspondence
The 1 st string of correspondence is taken as the 1 st valid string of correspondence in the order in the literature.
Wherein,
the starting position of the ith valid string in the document is marked as s (D, i)
The end position of the ith valid string in the document is marked as e (D, i)
The starting position of the ith valid consistent character string in the character string to be searched is marked as s (C, i)
The end position of the ith valid identical string in the string to be retrieved is marked as e (C, i). (2) Determination of the next valid consistent string
When the ith valid string is determined, the (i +1) th valid string is determined as follows.
And when the initial consistent character string meets the following two conditions of a) and b), taking the (i +1) th valid consistent character string. a) e (D, i) +1 < ═ s (D, i +1) < ═ e (D, i) + L +1
The above formula means: when the extra character sandwiched between the ith valid coincident character string and the (i +1) th valid coincident character string is allowed to be below L characters
(see example 3 below) b) s (C, i +1) > e (C, i) - (M-1)
This is repeated until a qualified valid string of consistent characters is not selected. (3) Determination of "similar character string" and its "degree of similarity" (degree of similarity)
If the above-mentioned valid matching character string is not selected, the "similarity" is calculated as follows, with the starting character of the 1 st valid matching character string to the last character of the last valid matching character string as the "similar character string". Degree of similarity ═
(number of characters belonging to a valid identical string in the string to be retrieved
The number of characters of the character string to be retrieved,
number of characters in "similar string" belonging to valid identical string
Number of characters of/"similar string") minimum value E4. "number of characters belonging to valid identical string" and method for calculating the number of characters belonging to valid identical string
When 2 characters are the same as the corresponding characters in the character string to be retrieved, the 1 st character is calculated according to 1, and the 2 nd character is calculated according to 0.5. Other cases a character is calculated as 1. (refer to example 4 hereinafter) E5. "determination order of similar character strings
The 1 st "similar string" is determined by comparison starting from the beginning of the document. In the process when the ith "similar character string" is determined, characters which do not belong to the valid identical character string are found out from the beginning character of the ith "similar character string" backward, and then comparison is started to find out the (i +1) th similar character string.
By appropriately assigning the constant L, M, a "similarity" that is reasonably consistent with a general judgment of a person can be calculated according to whether the arrangement of characters is similar.
In addition, when the "similarity" is the highest value of 1, it means that the character strings are completely identical, and when the character strings are completely identical, the "similarity" must be 1. E6. Flow chart of fuzzy search
The above processing is represented by a flowchart, and is shown in fig. 6. In fig. 6, first, at step 602, a search string is prompted for input. In step 604, a prompt is provided to input a similarity of 0 to 1. The input of the character string and the numerical value in the steps 602 and 604 is typically performed on a dialog box using an input box and a scroll bar.
At step 606, the number i of the valid matching string is set to 1, and at step 608, a search for a valid matching string is performed. In this case, assuming that a condition that the valid matching string length is set to M or more is satisfied, it is advantageous to create an index file in accordance with the fonts of M characters in the processing of fig. 4. This is because if such an index file is prepared in advance, it is possible to perform a search in an arbitrary M-shape at high speed by halving the index file. Then, the index file is used to perform font search of M characters by shifting 1 character from the starting position of the font of M characters, if the document number detected by the result is the same as the font search of the previous M characters, and the position numbers in the document are sequential, an effective consistent character string with the length of M +1 can be obtained. With the above method, if the document number is the same as the previous glyph search for M characters and the position numbers in the document are sequential, then 1 is added to the length of the effectively identical string each time the above condition is satisfied. However, if the font search of M characters using the index file is not found, the returned document numbers do not match, or the position numbers in the documents are not sequential, the end position of the valid matching character string is reached.
Depending on the situation, a valid matching character string may not be found at all, and in this case, the process proceeds to step 626 in accordance with the determination of step 610, and the process ends.
In step 610, if it is determined that a valid consistent character string is found, go to step 612 for processing; in the literature, from s (D, i) to e (D, i); in the search string, from s (C, i) to e (C, i), the flags of the valid strings are made.
At step 614, if the following conditions are found to be satisfied: a) e (D, i) +1 < ═ s (D, i +1) < ═ e (D, i) + L +1
And, b) s (C, i +1) > e (C, i) - (M-1)
Continuing to search the (i +1) th valid consistent character string by using the index file, if the (i +1) th valid consistent character string is found, returning to the step 612, and for the (i +1) th valid consistent character string, from s (D, i +1) to e (D, i +1) in the literature; from s (C, i +1) to e (C, i +1) in the search string, the marking of the valid string is made. (plus i at step 618, indicating for the next valid string of correspondence)
On the other hand, if no valid consistent string found earlier is found at step 616, then a similarity calculation is performed at step 620. The method is as described above, for example, calculated by the following formula,
degree of similarity ═
(number of characters belonging to a valid identical string in a string to be retrieved
The number of characters of the character string to be retrieved,
number of characters in "similar string" belonging to valid identical string
/"number of characters of similar string") the minimum value at this time, "similar string" is a string from the beginning of the first valid string of identical characters to the last of the last valid string of identical characters.
In step 622, the result is selected based on the similarity calculated in step 620 and the similarity input in step 604, and only when the result is greater than the similarity input in step 604, the result is displayed in step 624.
The processing operation at step 624 is to access the contents of the document stored in the database based on the document number and the in-document location number returned from the index file search results at steps 608 and 614, and to display the line on which the location is located.
In addition, a "similar string" of a search string may be found in multiple documents simultaneously, or in multiple places in a document. Therefore, it should be noted that steps 606-622 are applied to such multiple "similar strings", and only those "similar strings" that satisfy the similarity condition are selected for display in step 624. E7. Example of determining "similar character string" and similarity
In the example shown, M is 2 and L is 3. (example 1)
123456 string to be retrieved C: アイビ - エム # (アイビ - エム # are trademarks of IBM corporation)
12345678 … document D: アイ, ビ, エム …
The longest string of characters at the beginning is "アイ", and therefore
The 1 st valid matching string is "アイ" s (C, 1) ═ 1e (C, 1) ═ 2
s(D,1)=1e(D,1)=2
Since e (C, 1) - (M-1) ═ 1, a character string starting from the 2 nd character of the character string to be retrieved is compared with a character string starting from document 3, 4, 5 or 6 to retrieve the 2 nd valid matching character string (since e (D, 1) +1 ═ 3, and e (D, 1) + L +1 ═ 6).
The 2 nd valid matching character string is "ビ -" s (C, 2) ═ 3, e (C, 2) ═ 4s (D, 2) ═ 4e (D, 2) ═ 5
Since e (C, 2) - (M-1) ═ 3, the character string starting from the 4 th character of the character string to be retrieved is compared with the character strings starting from documents 5, 6, 7 or 8 to retrieve the 3 rd valid matching character string (since e (D, 2) +1 ═ 6, and e (D, 2) + L +1 ═ 9).
The 3 rd valid matching string is "エム" s (C, 3) ═ 5e (C, 3) ═ 6
s(D,3)=7e(D,3)=8
The 3 rd valid string is the last one because the end of the string to be retrieved has been reached.
アイビ-エム
1 2 3
アイ·ビ-·エム…
1 2 3
The number is the number of a valid consistent string. Thus, the "similar character string" is "アイ · ビ - · エム" from s (D, 1) to e (D, 3). "similarity" (6/6, 6/8) minimum 6/8 ═ 0.75 (example 2)
12345678910 string C to be retrieved: ソフトウエア メ - カ -
123456789 … document D: ソフト developed メ - カ - …
ソフトウエアメ-カ-
1 2
ソフト developed メ - カ - …
12 "similar character string" ═ ソフト development メ - カ - "similarity ═ minimum value of (7/10, 7/9) ═ 0.7 (example 3)
1234 to-be-retrieved string C: home complaints
123456789 … document D: prosecution にふみきつた at home periphery まま で. Since the longest matching character string at the head is "at home", the 1 st valid matching character string is "at home" s (C, 1) ═ 1e (C, 1) ═ 2
s(D,1)=1e(D,1)=2
A character string starting from the 2 nd character of the character string to be retrieved (with the factor e (C, 1) - (M-1) ═ 1) is compared with a character string starting from document 3, 4, 5 or 6 (with the factor e (D, 1) +1 ═ 3, e (D, 1) + L +1 ═ 6), and the 2 nd valid matching character string is retrieved.
Since the 2 nd valid string is not found, and since the end of the string to be retrieved has been reached, only the 1 st is a valid string.
Home complaints
1
Prosecution にふみきつた at home periphery まま で.
1 thus, the 1 st "similar string" is the "home" from s (D, 1) to e (D, 3). Minimum value of similarity (2/4, 2/2) is 0.5
The following non-significant coincident character at the beginning of "is" peripheral ". Retrieve the 2 nd "similar string" from the following of "the bone of higher building", then
Home complaints
1
Prosecution にふみきつた at home periphery まま で.
1
However, in the literature, the "home" and the "appeal" are separated by 4 characters, and in this example, L is 3, so the "appeal" cannot be regarded as a valid uniform character string.
(example 4)
123 character string to be retrieved C: silver clerk
1 2 3 4 5 6 7 8 9
Document D: all the Chinese medicinal materials are used in Chinese medicine, B さ
The longest consistent string at the beginning is "Bank", so
The 1 st valid matching string is "bank" s (C, 1) ═ 1e (C, 1) ═ 2
s(D,1)=2e(D,1)=3
(formula 7)
A character string starting from the 2 nd character of the character string to be retrieved (with the factor e (C, 1) - (M-1) ═ 1) is compared with a character string starting from 4, 5, 6, or 7 of the document (with the factor e (D, 1) +1 ═ 4, e (D, 1) + L +1 ═ 7), and the 2 nd valid matching character string is retrieved.
The 2 nd valid string is "member" s (C, 2) ═ 2e (C, 2) ═ 3
s(D,2)=4e(D,2)=5
There are two valid strings that are consistent because the end of the string to be retrieved has been reached.
Silver clerk
1
2
All the Chinese medicinal materials are used in Chinese medicine, B さ
1 2
1.1.0.51 → 3.5 "similar string" is a "banker" from s (D, 1) to e (D, 2). メ - カ -0.909 at positions where "similarity" (3/3, 3.5/4) is a minimum value of 3.5/4 to 0.875E8., which is close to the fuzzy example of human perception, ソフトウエア メ - カ - ソフトウエア
ソフトウエア developed メ - カ -0.833
ソフトウエア Prayer メ - カ -0.769
This example shows that "similarity" decreases with the sandwiching of the extra characters.
ニツトウエアメ-カ- 0.800
ソフトメ-カ- 0.700
ソフトウエア 0.600
This example shows that as the number of coincident characters decreases, "similarity" decreases. Division of science and chief election 1.000
Science leader election 0.929
The relationship between the structure of the word "Chizizhao" index 0.857E9. and the search for "similar character string" is considered
By setting the value of M properly, the fuzzy search for searching the similar character string can be realized at high speed by using the index structure of the invention. Determination method N of constant N, M: number of characters M of the glyph stored in the index: minimum length L of valid consistent string for fuzzy search: in the fuzzy search, the maximum length of the non-valid coincident character string in the "similar character string".
If N is large, the number of font types increases and the amount of detected font data decreases, so that the search speed is high, but the size of the index file increases. In the general japanese literature, a sufficient search speed can be obtained with N2.
If M is confirmed under the condition that M is more than or equal to N, a sufficient search speed can be obtained in the fuzzy search. Taking M to N is considered satisfactory from the viewpoint that the smaller M is, the finer the blur search can be. E10. Example 2 for determining similarity
In the fuzzy search processing according to embodiment 2, it is considered that the "more non-uniform characters are sandwiched in the middle, the more the non-uniform characters are felt, and the" more non-uniform characters are sandwiched in the middle, the same character string is not felt ", in particular, both of these aspects are considered. If the character strings input in the document are arranged according to the sequence of a consistent character string, a non-consistent character string and a consistent character string, and similar character strings are extracted before the following consistent character string, the similarity degree is reduced, and the method is not logical. For example, when the input character string is "home complaint", and document 1 is "home complaint まま で at the home", and document 2 is "home", the similarity is determined by the rule that "home person is high" although the similarity is similar to the similarity in both "home まま で at the home" in document 1 and "home" in document 2, and as a result, the feeling of human is contrary to the result. If it is determined that the similarity of "initiating a complaint at the home position まま で" is higher than that of "home", or that there are two similar character strings of "home" and "initiating a complaint" in document 1, it is logical.
Hereinafter, the processing of embodiment 2 is explained. As shown in the flowchart of fig. 6, in the present embodiment, steps 602 to 612 are the same, and step 614 showing the i +1 th valid matching string search condition has the following variations.
s (C, i +1) > e (C, i) - (M-1) … (formula 1)
s (D, i +1) > e (D, i) … (formula 2)
Furthermore, it is possible to provide a liquid crystal display device,
s(D,i+1)-e(D,i)-1
+ max (e (C, i) -s (C, i +1) +1.0) L … (formula 3)
s (C, i), e (C, i), s (D, i), e (D, i), etc. are as defined above.
Equation 1 allows repeated occurrences of characters such as the aforementioned "part" of "physical part length" to be M-1 or less, and in addition, this means that all character strings occurring in the same order as the character order in the input character string are effective.
Equation 2 means that the valid consistent strings are not repeated in the literature.
Equation 3 means that the sum of the non-uniform character sandwiched between the characters and the repeated character such as the "part" of the "length of the physical part" is allowed to be less than L characters.
In this embodiment, as in the previous embodiment, the ratio of the effectively identical character strings in each of the character strings similar to the search character string in the document is calculated, wherein the similarity is not selected when the ratio is small, and the similar character strings are divided by the full score (score when they are completely identical) to obtain the ratio, and the calculation is performed. The characters are added with scores according to the following rules, and the scores of the similar character strings are calculated through accumulation. Therefore, the following processing is performed in step 620 of fig. 6. The characters … 1 belonging to the 1 st valid string are classified as the characters belonging to the ith (i > 1) valid string
Position ≧ e (C, i-1) +1 (equation 4) … 1 in search string
The position of the search string is less than or equal to e (C, i-1) +1 (formula 5) … -1/(2 x L) and the characters which do not belong to the effective consistent character string are divided into … -1/L
In the present embodiment, in the process of determining the ith similar character string, characters which do not belong to a valid identical character string are found from the beginning character of the ith similar character string backwards, and then comparison is started to find the (i +1) th similar character string.
The negative score of a character not belonging to a valid character string is set in consideration of the consideration of both the aspect that "more non-uniform characters are sandwiched in the middle and the aspect that more non-uniform characters are felt, and the aspect that more non-uniform characters are sandwiched in the middle and the aspect that the same character string is not felt. Since the total negative score of one non-matching character string is 1/L × L at maximum, the minimum positive score of the next matching character string is taken, and the negative score does not exceed the positive score when N ≧ 1 (2 is recommended specifically for japanese). Equation 5 represents a character that appears repeatedly, such as the "part" of the "physical part length", and equation 4 represents a simple coincident character that is not a character that appears repeatedly. For the character represented by equation 5, a smaller negative score than a simple non-uniform character is added to address the case where a repeated character occurs. E11. Example of determining similar character strings and similarity in embodiment 2
As an example, N is 2, L is 3
(example 5)
Inputting a character string C: アイビ - エム
1 2 3 4 5 6 7 8…
Part D of the literature: … アイ, ビ, エム …
The initial longest consensus string is "アイ", so the 1 st valid consensus string is "アイ"
s(C,1)=1e(C,1)=2
From the expressions 1, 2, and 3, s (D, 1) ═ 1e (D, 1) ═ 2, the 2 nd valid match string is "ビ -"
s(C,2)=3e(C,2)=4
As can be seen from formulas 1, 2 and 3, the 3 rd valid matching character string is "エム" when s (D, 2) ═ 4e (D, 2) ═ 5 "
s(C,3)=5e(C,3)=6
since the end of the character string to be searched is reached, s (D, 3) ═ 7e (D, 3) ═ 8, there are 3 valid matching character strings.
C:アイビ-エム
1 2 3
D:アイ·ビ-·エム
1 2 3
The fraction 1-1.1.1.1.1.
-1/3 -1/3
The similar character string is "アイ · ビ - · エム" from s (D, 1) to e (D, 3). Similarity ((1 × 6+ (-1/3 × 2)/6) ═ 0.88 (example 6)
1 2 3 4 5 6 7 8 9 10
Inputting a character string C: ソフトウエア メ - カ -
1 2 3 4 5 6 7 8 9…
Part D of the literature: … ソフト developed メ - カ - …
C: ソフトウエアメ-カ-
1 2
D: ソフト developed メ - カ - …
12 similarity string ═ ソフト developed メ - カ - "similarity ═ ((1 × 7+ (-1/3) × 2)/10) ═ 0.63 (example 7)
1 2 3 4
Inputting a character string C: home complaints
1 2 3 4 5 6 7 8 91011121314…
Part D of the literature: prosecution にふみきつた at home periphery まま で.
Since the first matching character string is "at home", the 1 st valid matching character string is "at home", and since the next matching character string is "origin" and does not satisfy expression 3, only the 1 st valid matching character string is.
C: home complaints
1
D: prosecution にふみきつた at home periphery まま で.
1 the similar string is "at home". Similarity is 2/4 0.5
The following non-significant coincident character at the beginning of "is" peripheral ". The 2 nd similar string should be retrieved from "the periphery".
C: home complaints
1
D: prosecution にふみきつた at home periphery まま で.
1 thus, the 2 nd similar string is "prosecution". (example 8)
1 2 3 4 5 6 7
Inputting a character string C: the principal of science に ren
1 2 3 4 5 6 7 8
Part D of the literature: … the principle division length に and the division length に respectively have two effective consistent character strings of …, namely "science division" and "division length に and the division length".
C: the principal of science に ren
1
>2
D: the length of the science department に is ren
1 2
1.1.1. 1.1.1.1
-1/6 the similar string is "science length に talent". The 2 nd "section" satisfies formula 5. Therefore, the similarity ((1 × 7+ (-1/6) × 1)/7) ═ 0.97. E12. Summary of results of example 2
Similarity in input string documents
ソフトメ-カ- ソフトのメ-カ- 0.95
ソフト Provisions of メ - カ -0.85 rules for political capital regulation law for political capital regulation
Political capital 0.50 division of science に ren and principle division of science に ren and 0.97
Zhi Zhong に Zhi ren 0.95
As described above, according to the present invention, it is possible to obtain an effect of realizing a fuzzy search by a human sense at a high speed using a specific index structure for a text file or a database.

Claims (56)

1. An information retrieval method for finding out, by computer processing, a similarity of document character strings similar to a retrieval character string among documents stored in a retrievable manner, the method comprising the steps of:
(a) inputting a search string;
(b) a step of extracting a partial character string having a length of M characters or more ((M is a predetermined integer of 2 or more) from the beginning of the search character string and detecting a start position and an end position matching the extracted partial character string in the document (hereinafter, a partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string);
(c) searching for the valid matching character string if a response indicating that no valid matching character string is detected occurs in step (b), by shifting a character from the start position of the partial character string of the search character string, and then by selecting a partial character string having a length of at least M characters (M is a predetermined integer of at least 2);
(d) searching for a valid matching character string having a starting position within a distance of L characters (L is a predetermined integer of 1 or more) from a partial character string starting position of the search character string and a search starting position in the document, respectively, by a length corresponding to the valid matching character string just detected, if a response for detecting the valid matching character string occurs;
(e) continuing the step (d) as long as the valid matching character string is detected;
(f) and calculating a similarity between the search string and a string from the start position of the first valid matching string of the document to the end position of the last valid matching string of the document based on information existing in the valid matching strings at least in the string from the start position of the first valid matching string of the document to the end position of the last valid matching string of the document.
2. The information retrieval method according to claim 1, characterized in that: m is 2, and L is 3 or more.
3. The information retrieval method according to claim 1, characterized in that: the calculation of the similarity takes a small value among the proportion of valid matching character strings in the search character string and the proportion of valid matching character strings between the start position of the first valid matching character string of the document and the end position of the last valid matching character string of the document.
4. The information retrieval method according to claim 1, characterized in that: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and a score is subtracted when it does not belong to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the perfect matching score.
5. An information retrieval method for finding, by computer processing, a position where a retrieval string appears in a document stored in a retrievable manner, the method comprising the steps of:
(a) inputting a search string;
(b) inputting similarity;
(c) a step of extracting a partial character string having a length of one or more than M characters (M is a predetermined integer of 2 or more) from the head of the search character string, and detecting a start position and an end position that match the partial character string in the above document, hereinafter, a partial character string having a length of one or more than M characters, which is determined by the start position and the end position, is referred to as a valid matching character string one;
(d) searching for the valid matching character string if a response indicating that no valid matching character string is detected occurs in step (c), by shifting a character from a start position of a partial character string of the search character string, and by selecting a partial character string of a predetermined integer having a length of at least M characters and having a length of at least M2;
(e) a step of searching for a valid matching character string from a partial character string start position of the search character string and a search start position in the document, if a response to the detection of the valid matching character string occurs, by a length corresponding to the valid matching character string just detected, and from a predetermined integer having a distance from the start position to the valid matching character string just detected within L characters, L being 1 or more;
(f) continuing the step (e) as long as the valid matching character string is detected;
(g) and calculating a similarity between the search string and the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
(h) If a response is obtained by the calculation that the similarity is greater than the similarity input in the step (b), the contents of the valid matching character string contained in the document are displayed.
6. The information retrieval method according to claim 5, characterized in that: m is 2, and L is 3 or more.
7. The information retrieval method according to claim 5, characterized in that: the calculation of the similarity takes a small value among the proportion of the valid matching character strings in the search character string and the proportion of the valid matching character strings in the character strings between the start position of the first valid matching character string of the document and the end position of the last valid matching character string of the document.
8. The information retrieval method according to claim 5, characterized in that: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and a score is subtracted when it does not belong to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the perfect matching score.
9. The information retrieval method according to claim 8, characterized in that: the above-mentioned addition is divided into a character 1, and the above-mentioned subtraction is divided into a character 1/L.
10. An information retrieval method for detecting, by computer processing, a similarity of a document character string similar to a retrieval character string in a database of a plurality of documents stored in a retrievable manner, the method comprising the steps of:
(a) inputting a retrieval character string;
(b) a step of extracting a partial character string having a length of M characters or more and M being a predetermined integer of 2 or more from the beginning of the search character string and detecting a start position and an end position matching the extracted partial character string in the same document in the database, wherein the partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string-;
(c) if no valid matching character string is detected in the step (b), searching the valid matching character string by shifting a character from the starting position of the partial character string of the search character string, and then taking a partial character string of which the length is more than M characters and M is a predetermined integer more than 2;
(d) searching for a valid matching character string having a distance between a start position and a valid matching character string detected immediately before, within L characters, L being a predetermined integer of 1 or more, by shifting only a length corresponding to the valid matching character string detected immediately before from a partial character string start position of the search character string and a search start position in the same document, if a response for detecting a valid matching character string occurs;
(e) continuing the step of step (d) as long as the valid consistent character string is found;
(f) and calculating a similarity between the search string and the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
11. The information retrieval method according to claim 10, characterized in that: m is 2 and L is 3.
12. The information retrieval method according to claim 10, characterized in that: the calculation of the similarity takes a small value between the proportion of the valid matching character strings in the search character string and the proportion of the valid matching character strings in the character strings from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
13. The information retrieval method according to claim 10, characterized in that: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and a score is subtracted when it does not belong to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the perfect matching score.
14. An information retrieval method for checking out, by computer processing, where a retrieval string appears in a database of a plurality of documents stored in a retrievable manner, the method comprising the steps of:
(a) inputting a search string;
(b) inputting similarity;
(c) a step of extracting a partial character string having a length of M characters or more and M being a predetermined integer of 2 or more from the beginning of the search character string and detecting a start position and an end position matching the extracted partial character string in the same document in the database, wherein the partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string-;
(d) searching for the valid matching character string if a response indicating that no valid matching character string is detected occurs in step (c), by shifting a character from a start position of a partial character string of the search character string, and by selecting a partial character string of a predetermined integer having a length of at least M characters and having a length of at least M2;
(e) searching for a valid matching character string having a distance of L characters or less from the start position of the partial character string of the search character string and the search start position in the same document, L being a predetermined integer of 1 or more, if a response indicating that a valid matching character string is detected occurs, by shifting only the length of the valid matching character string just detected from the start position of the partial character string of the search character string and the search start position in the same document;
(f) continuing the step of step (e) as long as the valid consistent character string is found;
(g) and calculating a similarity between the search string and the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document.
(h) If a response is obtained through the calculation that the degree of similarity is greater than the degree of similarity input in the step (b), the contents of the valid matching character string contained in the document are displayed.
15. The information retrieval method according to claim 14, characterized in that: the method includes a step of labeling the plurality of documents with a unique number or symbol in advance.
16. The information retrieval method as recited in claim 15, wherein: the above-mentioned inherent numbers or symbols are numbers in order.
17. The information retrieval method according to claim 14, characterized in that: m is 2 and L is 3.
18. The information retrieval method according to claim 14, characterized in that: the calculation of the similarity takes a small value among the proportion of the valid matching character strings in the search character string and the proportion of the valid matching character strings in the character strings between the start position of the first valid matching character string of the document and the end position of the last valid matching character string of the document.
19. The information retrieval method according to claim 14, characterized in that: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and a score is subtracted when it does not belong to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the perfect matching score.
20. The information retrieval method as recited in claim 19, wherein: the above-mentioned addition is divided into a character 1, and the above-mentioned subtraction is divided into a character 1/L.
21. An information retrieval system for detecting, by computer processing, a position where a retrieval character string appears in a document stored in a retrievable manner, the system having:
(a) means for inputting a search string;
(b) a device for searching for a matching start position and end position in the above document, wherein a partial character string having a length of M characters or more to a predetermined integer having M characters or more to 2 or more is taken as a search character string, and a partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string;
(c) a means for searching for a valid matching character string by using the means (b) from the beginning of the search character string, if a response for detecting a valid matching character string occurs, by shifting only the length of the valid matching character string just detected from the start position of the partial character string of the search character string and the search start position in the document, and by shifting the start position of the valid matching character string within L characters, L being a predetermined integer of 1 or more, from the start position of the valid matching character string just detected, and if a response for detecting a valid matching character string does not occur, by shifting one character from the start position of the partial character string of the search character string and then taking out a partial character string of M characters or more (M being a predetermined integer of 2 or more) and searching for the valid matching character string;
(d) means for continuing said step (d) as long as said valid matching string is detected;
(e) and a means for calculating the similarity between the search string and the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
22. An information retrieval system for detecting, by computer processing, a position where a retrieval character string appears in a document stored in a retrievable manner, the system comprising:
(a) means for inputting a search string;
(b) means for inputting the similarity;
(c) a device for searching for a matching start position and end position in the above document, wherein a partial character string having a length of M characters or more to a predetermined integer of 2 or more is taken as a search character string, and a partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string;
(d) a means for searching for a valid matching character string by applying the means (c) from the beginning of the search character string and by shifting only the length of the valid matching character string just detected from the start position of the partial character string of the search character string and the search start position in the document based on the occurrence of a response to detect a valid matching character string, the valid matching character string having a distance of L characters or less from the start position of the valid matching character string just detected and L being a predetermined integer of 1 or more, and if no response to detect a valid matching character string has occurred, the valid matching character string being searched by shifting one character from the start position of the partial character string of the search character string and then taking out a partial character string of M characters or more (M being a predetermined integer of 2 or more);
(e) means for continuing said step (d) as long as said valid matching string is detected;
(f) means for calculating a similarity between a character string at least between a start position of a first valid matching character string of the document and an end position of a last valid matching character string of the document and the search character string, based on information present in the valid matching character string;
(g) and (c) means for displaying the contents of the valid matching character strings included in the document if the calculated similarity is greater than the similarity inputted in the step (b).
23. The information retrieval system as recited in claim 22, wherein: m is 2 and L is 3.
24. The information retrieval system as recited in claim 22, wherein: the calculation of the similarity takes a small value among the proportion of valid matching character strings in the search character string and the proportion of valid matching character strings in the character strings from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
25. The information retrieval system as recited in claim 22, wherein: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and a score is subtracted when it does not belong to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the perfect matching score.
26. The information retrieval system as recited in claim 25, wherein: the above-mentioned addition is divided into a character 1, and the above-mentioned subtraction is divided into a character 1/L.
27. An information retrieval system for detecting, by computer processing, a position where a retrieval character string appears in a database of a plurality of documents stored in a retrievable manner, the system comprising:
(a) means for inputting a search string;
(b) means for inputting similarities;
(c) a device for searching for a matching start position and end position in the above document, wherein a partial character string having a length of M characters or more to a predetermined integer having M characters or more to 2 or more is taken from the beginning of the search character string, and a partial character string having a length of M characters or more determined from the start position and the end position is referred to as an effective matching character string;
(d) a means for searching for a valid matching character string by using the means (c) from the beginning of the search character string if a response for detecting a valid matching character string occurs, and by shifting only the length of the valid matching character string just detected from the start position of the partial character string of the search character string and the search start position in the same document, and searching for a valid matching character string whose start position is within L characters of the distance from the valid matching character string just detected-L is a predetermined integer of 1 or more, and if a response for detecting a valid matching character string does not occur, by shifting one character from the start position of the partial character string of the search character string, and then retrieving a partial character string of M characters or more and having a length of M being a predetermined integer of 2 or more;
(e) means for continuing said step (d) as long as said valid matching string is detected;
(f) means for calculating a similarity between a character string at least between a start position of a first valid matching character string of the same document and an end position of a last valid matching character string of the same document and the search character string, based on information present in the valid matching character string, from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document;
(g) and (c) means for displaying the contents of the valid matching character strings included in the document if the similarity calculated as described above is greater than the similarity inputted in the step (b).
28. The information retrieval system as recited in claim 27, wherein: m is 2 and L is 3.
29. The information retrieval system as recited in claim 27, wherein: the calculation of the similarity takes a small value among the proportion of the valid matching character strings in the search character string and the proportion of the valid matching character strings in the character strings between the start position of the first valid matching character string of the document and the end position of the last valid matching character string of the document.
30. The information retrieval system as recited in claim 27, wherein: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and not a score is subtracted when it belongs to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the full matching score.
31. The information retrieval system as recited in claim 30, wherein: the above-mentioned addition is divided into a character 1, and the above-mentioned subtraction is divided into a character 1/L.
32. A method of making an index file with which glyphs of length N characters can be retrieved at high speed in documents stored in a retrievable manner by computer processing, the method comprising the steps of:
(a) sequentially scanning the document and writing the font of any N characters appearing in the document and the information of the appearance position of the font of the N characters in the document into a storage area;
(b) a step of, if a response that all the documents have been scanned has occurred in the step (a), sorting information written in the storage area by the fonts, and adding the position information corresponding to the sorted information to the different fonts;
(c) and a step of creating and outputting a file using the font as a key so as to search the added position information corresponding to the font.
33. The index documentation method of claim 32 wherein: the search in the step (c) is a binary search.
34. The information retrieval method according to claim 1 or claim 5, characterized in that: in the above document, search for M characters in a glyph is performed using an index file created by the method of claim 33 and using position information detected from the index file.
35. A method of making an index file with which a font of N characters in length can be retrieved at high speed in a database of a plurality of documents stored in a retrievable manner by computer processing, the method comprising the steps of:
(a) a step of adding a symbol or a number for identifying each of the plurality of documents individually;
(b) sequentially scanning the database and writing information on the fonts of arbitrary N characters appearing in each document of the database and the positions where the fonts of the N characters appear in the document into a storage area corresponding to the attached document identification symbol or number;
(c) if a response that all the documents in the database have been scanned is generated in step (b), sorting the information written in the storage area by the font, and adding the document identification number or serial number corresponding to the sorted information and the position information in the document to the different fonts;
(d) and a step of creating and outputting a file using the font as a key so as to search the attached document identification symbol or number and the position information corresponding to the font.
36. The index documentation method of claim 35 wherein: the search in the step (d) is a binary search.
37. The information retrieval method according to claim 10 or claim 14, characterized in that: in the above document, search for M characters in a glyph is performed using an index file created by the method according to claim 36, and using position information detected from the index file.
38. The index documentation method of claim 37 wherein: both M and N are 2, and L is 3 or more.
39. A method of indexing a document, using which a glyph of length N characters can be retrieved at high speed in a document stored in a retrievable manner by computer processing, the method comprising the steps of:
(a) sequentially scanning the documents, writing a pre-specified separator appearing in the documents and information on the appearance position of the separator in the documents into a storage area, and simultaneously writing the font of any N characters appearing in the documents continuously and information on the appearance position of the font of the N characters in the documents into the storage area;
(b) a step of, if a response that all the documents have been scanned has occurred in step (a), sorting information written in the storage area in accordance with the font, and adding the position information corresponding to the sorted information to the different fonts;
(c) and a step of creating and outputting a file using the font as a key so as to search the added position information corresponding to the font.
40. The information retrieval method according to claim 1 or claim 5, characterized in that: in the above document, search for M characters in a zigzag form is performed using an index file created by the method according to claim 39, and using position information detected from the index file.
41. The information retrieval method as recited in claim 40, wherein: and a step of searching for the separator in the document and adding position information detected when searching for M characters in a font in the document to a corresponding position.
42. A system for creating an index file, with which a font having a length of N characters can be retrieved at high speed in a document stored in a retrievable manner by computer processing, comprising:
(a) means for sequentially scanning the document and writing information on the position where the N character patterns appear in the document and the font pattern of any N characters appearing in the document consecutively in the document into a storage area;
(b) means for sorting information written in the storage area in accordance with the font when a response that all the documents have been scanned appears in the means (a), and adding the position information corresponding to the sorted information to the different fonts;
(c) and a device for generating and outputting a file by using the font as a key so as to search the added position information corresponding to the font.
43. A system for creating an index file, with which a font having a length of N characters can be retrieved at high speed in a database of a plurality of documents stored in a retrievable manner by computer processing, comprising:
(a) attaching a symbol or number to each of the plurality of documents for individually identifying the document;
(b) means for sequentially scanning the database and writing information on the fonts of arbitrary N characters appearing consecutively in each document of the database and the positions where the fonts of the N characters appear in the document into a storage area corresponding to the attached document identification symbol or number;
(c) means for sorting information written in the storage area in accordance with the font if a response that all documents in the database have been scanned is generated in the means (b), and adding the document identification number or serial number corresponding to the sorted information and position information in the document to the different font;
(d) and a device for generating and outputting a file by using the font as a key so as to search the attached document identification symbol or number and the position information corresponding to the font.
44. An index documentation system according to claim 43 wherein: all of the above N are 2.
45. A system for creating an index file, with which a font having a length of N characters can be retrieved at high speed in a document stored in a retrievable manner by computer processing, comprising:
(a) means for sequentially scanning the document, writing a previously specified delimiter appearing in the document and information on the position where the delimiter appears in the document into a storage area, and writing a font of arbitrary N characters appearing in the document and information on the position where the font of the N characters appears in the document into the storage area;
(b) means for sorting the information written in the storage area in accordance with the font when a response that all the documents have been scanned appears in the means (a), and adding the position information corresponding to the sorted information to the different fonts;
(c) and a device for generating and outputting a file by using the font as a key so as to search the added position information corresponding to the font.
46. A system for creating an index file, with which a font having a length of N characters can be retrieved at high speed in a database of a plurality of documents stored in a retrievable manner by computer processing, comprising:
(a) means for attaching a symbol or number to each of the plurality of documents for individual identification thereof;
(b) means for sequentially scanning the database, writing a previously designated delimiter appearing in each document of the database and information on the position of the delimiter appearing in the document into a corresponding storage area to which the document identification symbol or number is appended, and writing information on the position of the character pattern of any N characters and the position of the character pattern of the N characters appearing in each document of the database into the corresponding storage area to which the document identification symbol or number is appended;
(c) means for sorting information written in the storage area in accordance with the font when a response that all documents in the database have been scanned is generated in the means (b), and adding the document identification symbol or number and position information in the document corresponding to the sorted information in each of the different fonts;
(d) and a device for generating and outputting a file by using the font as a key so as to search the attached document identification symbol or number and the position information corresponding to the font.
47. An information retrieval method for enabling, by computer processing, the detection of a position where a retrieval character string appears in a database of a plurality of documents stored in a retrievable manner, the method comprising the steps of:
(a) inputting a search string;
(b) inputting similarity;
(c) a step of searching for a start position and an end position that match a partial character string having a length of at least M characters and a predetermined integer having M of 2 or more from the beginning of the search character string in the same document in the database, wherein the partial character string having a length of at least M characters and determined by the start position and the end position is referred to as an effective matching character string;
(d) get
The starting position of the ith valid coincident string in the literature is s (D, i)
The end position of the ith valid consistent character string in the literature is e (D, i)
The starting position of the ith valid consistent character string in the character string to be searched is s (C, i)
The end position of the ith effective consistent character string in the character string to be searched is e (C, i)
A step of retrieving the i +1 th valid identical character string satisfying the following two conditions:
e(D,i)+1≤s(D,i+1)≤e(D,i)+L+1
and is
s(C,i+1)>e(C,i)-(M-1)
In the above formula, L is a predetermined integer of 1 or more,
(e) continuing the step (d) as long as the valid matching character string is detected;
(f) and calculating a similarity between the search string and the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document.
(g) If the similarity obtained by the above calculation is greater than the similarity input in the above step (b), the contents of the above valid matching character strings contained therein are displayed in the above document.
48. The information retrieval method as recited in claim 47, wherein: m is 2 and L is 3 or more.
49. The information retrieval method as recited in claim 47, wherein: the calculation of the similarity takes a small value among the proportion of valid matching character strings in the search character string and the proportion of valid matching character strings in the character strings from the start position of the first valid matching character string of the document to the end position of the last valid matching character string of the document.
50. The information retrieval method as recited in claim 47, wherein: the similarity calculation takes a small value between the proportion of the search string in the valid matching strings and the proportion of the valid matching strings between the start position of the first valid matching string of the document and the end position of the last valid matching string of the document.
51. An information retrieval method for detecting, by computer processing, where a retrieval string appears in a database of a plurality of documents stored in a retrievable manner, the method comprising the steps of:
(a) inputting a search string;
(b) inputting similarity;
(c) a step of extracting a partial character string having a length of M characters or more and M being a predetermined integer of 2 or more from the beginning of the search character string and detecting a start position and an end position matching the partial character string in the same document in the database, wherein the partial character string having a length of M characters or more determined by the start position and the end position is referred to as an effective matching character string …;
(d) get
The starting position of the ith valid coincident string in the literature is s (D, i)
The end position of the ith valid consistent character string in the literature is e (D, i)
The starting position of the ith valid matching character string in the character string to be searched is s (C, i)
The end position of the ith valid matching character string in the character string to be searched is e (C, i)
A step of retrieving the i +1 th valid coincident character string satisfying the following conditions:
s(C,i+1)>e(C,i)-(M-1)
s(D,i+1)>e(D,i)
and is
s(D,i+1)-e(D,i)-1+max(e(C,i)-s(C,i+1)
+1.0)≤L
(in the above formula, L is a predetermined integer of 1 or more)
(e) Continuing the step (d) as long as the valid matching character string is detected;
(f) and calculating a similarity between the search string and the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document based on information present in the valid matching character string at least in the character string from the start position of the first valid matching character string of the same document to the end position of the last valid matching character string of the same document.
(g) Displaying the contents of the valid matching character strings contained in the documents if the calculated similarity is greater than the similarity inputted in the step (b).
52. The information retrieval method as recited in claim 51, wherein: m is 2, and L is 3 or more.
53. The information retrieval method as recited in claim 51, wherein: the calculation of the similarity takes a small value among the proportion of valid matching character strings in the search character strings and the proportion of valid matching character strings in the character strings between the start position of the first valid matching character string of the document and the end position of the last valid matching character string of the document.
54. The information retrieval method as recited in claim 51, wherein: for each character in the character string from the start position of the first valid matching character string of the above-mentioned document to the end position of the last valid matching character string of the above-mentioned document, a score is added when it belongs to the valid matching character string, and not a score is subtracted when it belongs to the valid matching character string, and the calculation of the above-mentioned similarity uses a value obtained by dividing the result score by the value of the full matching score.
55. The information retrieval method as recited in claim 51, wherein: the above-mentioned addition is divided into a character 1, and the above-mentioned subtraction is divided into a character 1/L.
56. And subtracting 1/(2L) for the characters which belong to the ith valid consistent character string and the corresponding search character string characters belong to the (i-1) th valid consistent character string.
CN 95118142 1994-11-22 1995-11-01 Information searching method and system Pending CN1151558A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP6287642A JP2669601B2 (en) 1994-11-22 1994-11-22 Information retrieval method and system
JP287642/94 1994-11-22

Publications (1)

Publication Number Publication Date
CN1151558A true CN1151558A (en) 1997-06-11

Family

ID=17719873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 95118142 Pending CN1151558A (en) 1994-11-22 1995-11-01 Information searching method and system

Country Status (3)

Country Link
JP (1) JP2669601B2 (en)
KR (1) KR960018993A (en)
CN (1) CN1151558A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100357946C (en) * 2004-06-09 2007-12-26 金宝电子(上海)有限公司 Electronic device and method for fast comparision searching word string
WO2008098495A1 (en) * 2007-02-14 2008-08-21 Jie Bai Method and device for determing object file
CN100424676C (en) * 1999-07-01 2008-10-08 株式会社日立制作所 Geographical name presentation method, method and apparatus for geographical name string identification
US7831241B2 (en) 2004-12-22 2010-11-09 Research In Motion Limited Entering contacts in a communication message on a mobile device
CN101517363B (en) * 2006-08-18 2012-09-26 谷歌公司 Providing routing information based on ambiguous locations
CN101939743B (en) * 2007-12-24 2013-10-16 高通股份有限公司 Apparatus and methods for retrieving/downloading content on a communication device
CN103425629A (en) * 2012-05-24 2013-12-04 富士通株式会社 Generation apparatus, generation method, searching apparatus, and searching method
CN104090875A (en) * 2013-04-01 2014-10-08 鸿富锦精密工业(深圳)有限公司 Information retrieval system and information retrieval method
CN113971702A (en) * 2021-10-29 2022-01-25 深圳市道通科技股份有限公司 Picture compression method, decompression method and electronic equipment

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3275816B2 (en) * 1998-01-14 2002-04-22 日本電気株式会社 Symbol string search method, symbol string search device, and recording medium recording symbol string search program
JP4042580B2 (en) * 2003-01-28 2008-02-06 ヤマハ株式会社 Terminal device for speech synthesis using pronunciation description language
CN1645374A (en) * 2005-01-17 2005-07-27 徐文新 Digit marking character string searching technology
WO2007026870A1 (en) 2005-09-02 2007-03-08 Nec Corporation Data clustering device, clustering method, and clustering program
JP5900367B2 (en) * 2013-01-30 2016-04-06 カシオ計算機株式会社 SEARCH DEVICE, SEARCH METHOD, AND PROGRAM
CN110033302B (en) * 2014-10-28 2023-08-04 创新先进技术有限公司 Malicious account identification method and device
CN108133016A (en) * 2017-12-22 2018-06-08 大连景竣科技有限公司 One kind does public document alignment system and method
CN112733524A (en) * 2020-12-31 2021-04-30 浙江省方大标准信息有限公司 Method, system and device for automatically correcting standard serial numbers and batch checking standard states

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3099298B2 (en) * 1991-03-20 2000-10-16 株式会社日立製作所 Document search method and apparatus
JPH07109603B2 (en) * 1990-12-12 1995-11-22 株式会社テレマティーク国際研究所 Information retrieval processing method and retrieval file creation device

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100424676C (en) * 1999-07-01 2008-10-08 株式会社日立制作所 Geographical name presentation method, method and apparatus for geographical name string identification
CN100357946C (en) * 2004-06-09 2007-12-26 金宝电子(上海)有限公司 Electronic device and method for fast comparision searching word string
US7831241B2 (en) 2004-12-22 2010-11-09 Research In Motion Limited Entering contacts in a communication message on a mobile device
CN101116357B (en) * 2004-12-22 2012-12-12 捷讯研究有限公司 Entering contacts in a communication message on a mobile device
US8675845B2 (en) 2004-12-22 2014-03-18 Blackberry Limited Entering contacts in a communication message on a mobile device
CN101517363B (en) * 2006-08-18 2012-09-26 谷歌公司 Providing routing information based on ambiguous locations
WO2008098495A1 (en) * 2007-02-14 2008-08-21 Jie Bai Method and device for determing object file
CN101939743B (en) * 2007-12-24 2013-10-16 高通股份有限公司 Apparatus and methods for retrieving/downloading content on a communication device
CN103425629A (en) * 2012-05-24 2013-12-04 富士通株式会社 Generation apparatus, generation method, searching apparatus, and searching method
CN103425629B (en) * 2012-05-24 2017-05-03 富士通株式会社 Generation apparatus, generation method, searching apparatus, and searching method
CN104090875A (en) * 2013-04-01 2014-10-08 鸿富锦精密工业(深圳)有限公司 Information retrieval system and information retrieval method
CN113971702A (en) * 2021-10-29 2022-01-25 深圳市道通科技股份有限公司 Picture compression method, decompression method and electronic equipment

Also Published As

Publication number Publication date
KR960018993A (en) 1996-06-17
JP2669601B2 (en) 1997-10-29
JPH08147320A (en) 1996-06-07

Similar Documents

Publication Publication Date Title
CN1109994C (en) Document processor and recording medium
CN1215433C (en) Online character identifying device, method and program and computer readable recording media
CN1151558A (en) Information searching method and system
CN1101032C (en) Related term extraction apparatus, related term extraction method, and computer-readable recording medium having related term extration program recorded thereon
CN1171162C (en) Apparatus and method for retrieving charater string based on classification of character
CN1194319C (en) Method for retrieving, listing and sorting table-formatted data, and recording medium recorded retrieving, listing or sorting program
CN1158627C (en) Method and apparatus for character recognition
CN1209725C (en) File edit processing method and apparatus, and program load medium
CN1204515C (en) Method and apparatus for free-form data processing
CN1132564A (en) Method and apparatus for data storage and retrieval
CN1281191A (en) Information retrieval method and information retrieval device
CN1894688A (en) Translation determination system, method, and program
CN1552032A (en) Database
CN1855103A (en) System and methods for dedicated element and character string vector generation
CN1331449A (en) Method and relative system for dividing or separating text or decument into sectional word by process of adherence
CN1117160A (en) System and method for generating glyphs for unknown characters
CN1535433A (en) Category based, extensible and interactive system for document retrieval
CN1173684A (en) Apparatus for recognizing input character strings by inference
CN1350250A (en) Integrated file writing and translating system
CN1869989A (en) System and method for generating structured representation from structured description
CN1647069A (en) Conversation control system and conversation control method
CN1752963A (en) Document information processing apparatus, document information processing method, and document information processing program
CN1290899A (en) Data management system for using multiple data operation modules
CN1300718C (en) Information display device and information display processing program
CN1266633C (en) Sound distinguishing method in speech sound inquiry

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication