WO1992009960A1 - Dispositif d'extraction de donnees - Google Patents

Dispositif d'extraction de donnees Download PDF

Info

Publication number
WO1992009960A1
WO1992009960A1 PCT/JP1991/000011 JP9100011W WO9209960A1 WO 1992009960 A1 WO1992009960 A1 WO 1992009960A1 JP 9100011 W JP9100011 W JP 9100011W WO 9209960 A1 WO9209960 A1 WO 9209960A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
search
character set
code
position information
Prior art date
Application number
PCT/JP1991/000011
Other languages
English (en)
Japanese (ja)
Inventor
Cyuichi Kikuchi
Original Assignee
Telematique International Laboratories
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from JP2338546A external-priority patent/JPH0782504B2/ja
Priority claimed from JP2417609A external-priority patent/JPH07109603B2/ja
Application filed by Telematique International Laboratories filed Critical Telematique International Laboratories
Publication of WO1992009960A1 publication Critical patent/WO1992009960A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • the present invention relates to an information search processing method for performing information search.
  • the present invention is particularly suitable for a full-text search processing method or a partial-match search processing method using a multi-keyword, so that the number of matches between the input search input and the full-text or registered keywords to be searched is significantly reduced. And a high-speed information search method.
  • INDUSTRIAL APPLICABILITY The present invention is suitable for an information search processing method for performing a full-text search process or a multi-keyword search in a database system.
  • a sequential search method is used in which the input character string specified by the searcher is used as a keyword character string and the search for a record number is performed from keywords that match the search conditions. Also, a sentence ij that can be searched and entered from the keyboard is created and stored in the search file in the bow I format, and keywords that match the input character string specified by the searcher and the search conditions are used using the index structure of the search file.
  • the index method of performing a search is generally used as a partial match search technique using a multi-keyword.
  • the sequential search processing method of the multi-keyword search processing requires the same search time as the sequential search method of the full-text search processing.
  • the hardware will be PI and the character string transfer between the computer that performs the search processing and the dedicated processor or LSI will take time. Therefore, realizing high-speed performance that is satisfactory for the system is an issue.
  • the index method in the multi-keyword search can speed up the partial match search, but has the disadvantage that the search file becomes huge. Because of this, exact, ⁇ ⁇ , and suffix searches are used, but intermediate matches are often not supported. This requires a large number of indexes for intermediate matches, in addition to search indexes for exact matches, prefix matches, and tail matches, in order to perform intermediate matches, resulting in a huge storage capacity for search files. The main reason is that the search time increases and ⁇ 1 of the search file is not easy. Also, some systems do not support all prefixes and suffixes of keywords because of the size of the search file. However, searchers often memorize special characters and character strings in keywords, and this includes parts that include an intermediate match. It has been demanded.
  • a character set is created for each character from the first character, one character at a time from the first character, followed by a total of r characters, and a search file is created with a character set group that is grouped by character set type.
  • a search file was created with character groups that were grouped for each search, and it was found that the search could be speeded up by collating character sets or character continuity from the search file during the search.
  • the present invention it is possible to realize a high-speed full-text search or a partial-match search using a multi-keyword for a large number of documents from the above-mentioned viewpoints.
  • This eliminates the need to transfer character strings to dedicated processors and LSIs, and allows for arbitrary character string searches by focusing on character sets and character set positions or characters and character positions.
  • the purpose is to provide an information retrieval processing method.
  • a first feature of the present invention is a search unit identification code assigning means for dividing a character string to be searched into search units, which are units for performing search, and assigning an ascending code to each search unit.
  • An attribute code assigning means for assigning an attribute code indicating a logical division of the search unit to the search unit, and extracting a character from the character string to be searched one character at a time, and setting the character set by the character and a total of the next r characters
  • Character set position order code assigning means for creating a character set position order code indicating the first character position of a character set in a search unit; the above-mentioned search unit identification code, character set position order code, and attribute code Means for generating character set position information, and storing the character set position information in an area for each character set type to create a search file.
  • n is the maximum number of search unit characters and a is the maximum number of attributes.
  • V When “V” is set, it is desirable to give it as a numeric code of ⁇ (search unit identification code XII) cross-character position order code ⁇ xa + attribute code.
  • a second feature of the present invention is to provide a search input character set including the search file created in the first feature, decomposing constituent characters of the search input character string into a character set in units of r characters from the first character.
  • the last character set may become (r-1) or less, and the character set of r characters may not be created.
  • a third feature of the present invention is to create a search file in which character position information is stored for each character type.
  • the search target character string is divided into search units, which are search units, in ascending order for each search unit.
  • Search unit identification code assigning means for assigning a code
  • attribute code assigning means for assigning an attribute code indicating a logical division of the search unit to the divided search units, and characters to be searched
  • Character position order code assigning means for assigning character position order information indicating a position in the search unit for each character in the column; and character position information comprising the search unit identification code, character position order code, and attribute code.
  • n Maximum number of search unit characters
  • a fourth feature of the present invention is to perform a search process using a search file created by the third feature, comprising a search file created by the third feature, and comprising a search input character string.
  • a method for extracting the character position information of the same character from the search file as described above, and the character unit position code is the same as the character string of the search input, with the common search unit identification code between the character position information of each character extracted.
  • Means for extracting a combination of character position information whose order and the attribute code are equal to the search input! / ⁇ A search unit to which a character string belongs based on the combination of the extracted character position information and Means for outputting a character position as a search result. It is desirable that the extraction of the combination of the character position information be performed at a low frequency of the whole sentence of the search input character and around J from the character.
  • a fifth feature of the present invention relates to a multi-keyword search, wherein a record identification code assigning means for assigning an ascending code to each record to be searched, and each key included in the record.
  • a keyword attribute code assigning means for assigning an attribute code indicating a logical division of a keyword to a word, and a character set is created by taking out one character at a time from this keyword, and creating a character set with the character and a total of subsequent r characters.
  • Character set position order code assigning means for assigning a character set position order code indicating the leading character position of a character string, and character set position information comprising the above-mentioned record identification code, keyword attribute code, and character set position order code.
  • the character set position information is obtained by arranging each keyword of the record in the keyword attribute area corresponding to the keyword attribute code. It is created by converting into a code consisting of integers with the character set position order code and the record identifier g
  • n Number of characters in keyword string
  • a sixth feature of the present invention relates to a search process for a search file created according to the fifth feature.
  • the search feature includes a search file created according to the fifth feature.
  • the record identification code and the keyword attribute code are common between the character set position information of the character sets, and the difference between the character set position order codes is equal to the difference in the first character position of the corresponding character set in the search input string.
  • Toku ⁇ Means for extracting a combination of character set position information having the same keyword attribute code as the search input, and a search input based on the extracted combination of character set position information. And Toku ⁇ further comprising a means for outputting a record identification code as a search result corresponding to the string.
  • the character set position information that can form the same character set string as the search input character set string is extracted from the character set position order code of the character set with low occurrence frequency in all keywords in the search input character set string.
  • i is the character set position order code of the character set with a high frequency of appearance
  • j is the character set position order code of the character set with the character set position order code i. It is desirable to extract a combination of character set position information that matches i) and j).
  • the keyword is a Western character string containing symbols
  • a search file that uses kanji as character position information in units of one character and kana characters as character set position information in units of two characters can be used.
  • a seventh feature of the present invention is that a multi-keyword search uses character position information in units of one character, and a record identification code assigning means for assigning an ascending code to each record to be searched, For each keyword in this record, the logical Keyword attribute code assigning means for assigning an attribute code indicating a class; character position order code assigning means for decomposing this keyword for each character and assigning each character a character position order code indicating a position in the keypad; Means for generating character position information comprising the record identification means, the key word attribute code and the character position order code, storing the character position information in an area for each character type, and generating a search file.
  • the feature is.
  • the character position information is obtained by arranging each key word of the record in the key word attribute area corresponding to the key word attribute code. It is created by converting the attribute code and character position order into a code consisting of integers.
  • n Number of characters in keyword string
  • Pa Keyword attribute code It is desirable to be given as the preceding numeric code in the key word sequence of the keyword attribute area of a.
  • An eighth feature of the present invention relates to a search process for a search file created according to the seventh feature, and includes a search file created according to the seventh feature, and is the same as a character constituting a search input character string.
  • the frequency of occurrence of the same character string in the document is low.
  • Kojien Japanese language dictionary published by Iwanami Shoten
  • the frequency of appearance of kana characters among them is as high as 53200 times on average.
  • the frequency of appearance of the two-letter kana character string is low, with an average frequency of 472 times.
  • the search input is n characters
  • the collation target extracted from the whole text will be (II / 2) X 72 character set position information on average.
  • the frequency of appearance of two kanji character strings is even lower than that of kana characters, and the collation target extracted from the whole sentence is less than that of kana characters.
  • the frequency of appearance of the JIS first-level kanji is 1155 times on average in the description of the headword of Kojien. For this reason, if the search input is n characters for the JIS first-level 2965 kanji, the collation target extracted from the description document of the Kojien headword will be nx 1155 characters on average.
  • the search input is generally several tens of characters or less, the number of times that a character string with a high frequency of occurrence (/, including characters) is significantly less than the number of times that all characters are collated sequentially.
  • the search target is rapidly narrowed down.
  • the search input character set is extracted from the character set strings of the search target candidates obtained so far. Character set columns that are different from the set columns are deleted, and the search target is narrowed down for each constituent character set to be matched.
  • matching is performed in order from the character set with the lowest occurrence frequency of all sentences in the search input or the lowest occurrence frequency in all keywords. And the number of times of matching and matching can be reduced.
  • a search file that stores character set position information for each character set that indicates where each character set is in the character string that constitutes the character string (full text or registered keyword) to be searched By performing collation matching with the search input character set string for this retrieval file, the number of collation matching processing in character string retrieval can be greatly reduced.
  • characters that have a low frequency of appearance such as kanji
  • kanji characters that have a low frequency of appearance
  • ⁇ "" matching the number of times of matching and matching processing can be significantly reduced.
  • This search file is created as follows. Note that this explanation is based on an example of a character set for full-text search processing.
  • a character string to be searched is divided into search units.
  • the search target character string is a book or a paper
  • it is composed of the table of contents, title, chapter or section title, text, figure or table title, and literature, and each constituent part is logical. Since it is classified, it can be configured as a search unit. Therefore, books or papers are logically divided into search units, and identification codes are assigned to each search unit in ascending order according to the order of appearance.
  • can be divided into a plurality of search units, and a series of identification codes can be assigned to each search unit together with other search units.
  • the logical type of the search unit is divided into the search unit, such as the table of contents, preface, title, and text, the attribute that indicates the attribute, with the logical type-class as the attribute Assign a sign.
  • the character string is extracted one character at a time from the first character, and a character set is created with that character and a total of subsequent r characters.
  • Each character set indicates the search unit identification code and the position of the first character of each character set.
  • the character set position information consisting of the set position order code and the attribute code of the search unit is ⁇ , stored in an area configured for each character set type, and the search target character string is stored in each character set type.
  • Create a file. This search file has a file structure in which character set position information is stored for each character set type.
  • the search input is decomposed from the first character into a character set in units of r characters to form a search input character set string, and the character set position information of the same character set as the decomposed character set is searched.
  • the character set position information that has the same search unit identification code and the same character set position sequence code as the difference between the first character position of the character set of the corresponding search input character string and the same attribute code. Check the combination and take it out.
  • the search input string is decomposed from the first character to a character set of r characters, the last character set may be (r-1) or less, and a character set of r characters may not be created. At this time, it extracts characters for the number of missing characters from the end of the character set immediately before the last character set, and concatenates them with the front of the last character set to create a character set in units of r characters.
  • This matching process checks the continuity of the character set string and the attribute match between the search input and the search file, and the search unit identification code is common from the character set position information in the search file. This is performed by extracting the combination of character sets whose difference in character set position order code is equal to the difference in the first character position of the character set of the corresponding search input character string and whose attribute code is the same as the search input.
  • the search unit When a character string that matches the search input is found in this way, the search unit to be extracted from the search unit identification code and the first sentence in the search unit for each character in the character set The character position indicating the position from the character is extracted and output to the searcher as a search result.
  • create a search file by storing each character of the full text in the character type area.
  • the search input character string is decomposed for each character, the character position information of each character is extracted from the search file, and the search unit identification code is common, in the same order as the search input character string, and in the attribute. Extracts the combination of character position information with the same code as the search input, and outputs the search unit and character position as the search result.
  • a record having a keyword is assigned an ascending record identification code in accordance with the registration order, and for each keyword, a keyword attribute code indicating the logical type of the keyword as an attribute.
  • a keyword attribute code indicating the logical type of the keyword as an attribute.
  • the character position sequence code or character set position sequence code in the keyword, character position information or character set position information is created from these three codes, and stored in the area for each character type or character set. Create a search file.
  • one Hi-input pair of the search input character string and the search input character string attribute is input.
  • the search input string is decomposed into one character or character set, and the same character position information or the same character set as the search input character set in the search file is used.
  • the position information is extracted and the combination of character position information or character set position information that has the same record identification code and the same character position order code or character set position order code and keyword attribute code as the search input is extracted.
  • the record identification number is extracted as a search result from the combination of the extracted character position information or character set position information.
  • FIG. 1 is a configuration example of an information search processing device used in an embodiment of the present invention.
  • Fig. 2 shows an example of a search file according to the first embodiment.
  • Figure 3 is a list of the second and third character combinations in each character set group of the first ⁇ M example.
  • Figure 4 shows the first example character set group address table.
  • FIG. 5 shows an example of registration of a search file according to the first embodiment.
  • FIG. 6 is a flowchart illustrating a search file creation processing procedure according to the first embodiment.
  • FIG. 5 is a flowchart illustrating a search processing procedure according to the first embodiment.
  • FIG. 8 shows a search file according to the second embodiment.
  • Fig. 9 shows a list of character set groups according to the second embodiment.
  • FIG. 10 is a character set group address table according to the second embodiment.
  • FIG. 11 shows an example of registration of a search file according to the second embodiment.
  • FIG. 12 is a character column address table of the third embodiment.
  • FIG. 13 shows an example of registration of a search file according to the third embodiment.
  • 14A and 14B are flowcharts for explaining a search file creation processing procedure according to the third embodiment.
  • FIG. 15 is a flowchart illustrating a search processing procedure according to the third embodiment.
  • FIG. 16 shows an example of a keyword sequence according to the fourth embodiment.
  • FIG. 17 shows an example of character set position information creation according to the fourth embodiment.
  • FIG. 18 shows an example of registration of a search file according to the fourth embodiment.
  • FIGS. 19A and 19B are flowcharts illustrating a search file creation procedure according to the fourth embodiment.
  • FIGS. 20A and 20B are flowcharts illustrating a search processing procedure according to the fourth embodiment.
  • FIG. 21 shows an example of a keyword string according to the fifth embodiment.
  • FIG. 22 shows an example of character set position information creation according to the fifth embodiment.
  • FIG. 23 shows an example of registration of a search file according to the fifth embodiment.
  • FIG. 24 shows an example of character position information creation according to the sixth embodiment.
  • FIG. 25 shows an example of registration of a search file according to the sixth embodiment.
  • FIG. 26A is a flowchart illustrating a search file creation procedure according to the sixth embodiment.
  • FIGS. 27A and 27B are flowcharts illustrating a search processing procedure according to the sixth embodiment.
  • FIG. 1 shows the configuration of an information search processing device according to an embodiment of the present invention.
  • the information search processing device of the present embodiment has a CPU that performs various arithmetic processing or determination processing. 1 and programs for search processing, search file creation, etc., search files created or used for search processing, memory for storing search inputs, etc., input / output unit 3 for connecting keyboard 4, display 5, display 3,
  • An external storage device control unit 6 for connecting an external storage device 7 for storing information, a CPUK memory 2, an input / output unit 3, and a common bus 8 for connecting the external storage device control unit 6 are provided.
  • the first embodiment is an embodiment in which a European character document is targeted for full-text search.
  • a character set is extracted from a character string to be provided for the search process, one character at a time from the first character of the character string, and a character set consisting of the character and the next character is created.
  • Search file creation processing that creates a search file consisting of character set groups grouped by character set type, and search that matches the search file and extracts character strings that match the search input And processing.
  • This search file creation processing can be roughly divided into 1) search file area reservation, 2) addition and assignment of character set position information to each character set, and 3) search file of character set position information grouped by character set type. Can be divided into three types. Each of these processes will be described.
  • the search file is composed of a character set group arranged in the character order of ASCII codes “20” to “7F” described in the ASCII code table.
  • Each character set group consists of three characters whose first character is the character that represents the name of each character set shown in Fig. 2.
  • the second and third characters of each character set group consist of the characters described in the ASCII code table as shown in Figure 3.
  • the character set A consists of the character sets “AA”, “AA!”, ⁇ “AA ⁇ ”, and “AA to J. Create a character set with a total of three characters: And the appearance frequency is counted.
  • the number of character set position information registered in each character set type group constituting the search file can be known, so that an area for the search file composed of all character set type groups can be secured.
  • the top ground of the character set type group stored continuously in the search file can be determined.
  • the character set group address table shown in Fig. 4 arranges the top grounds of this character set type group in the order of description of each character set shown in Figs.
  • the character set position information described here is based on the search unit number indicating the order in which the search unit to which the character set belongs, and the position where the character set appears in the search unit is determined by the position of the first character of the character set. It is composed of the character set position No. that indicates the character set and the attribute name that indicates the logical type of the search unit.
  • search units and their attributes will be described.
  • a typical book consists of a table of contents, preface, chapter or section titles, body text, figure or table titles, references, etc., and appears in this order.
  • search unit When searching for the contents of this book ⁇ , it is convenient to use this part as the search unit and to use that search unit as the search output, and it often matches the search purpose. That is, it is often the case that only the title or only the text is specified as a search target depending on the search purpose in actual search.
  • search unit indicates the logical classification of the character string to be searched
  • attribute search is given to this search unit according to the logical division. For example, as an attribute number, "1" in the table of contents, "2" in the preface, "3" in the chapter or section title, "4" in the figure or table title, "5" in the text, Assign “6”.
  • the ban numbers are assigned in ascending order from 1. This is used as the search unit number.
  • the text is a long sentence, categorize it appropriately. It is also possible to divide the text into multiple search units and assign search unit evaluations in the order in which they appear for each search unit.
  • a character set is extracted from the beginning of the search unit, one character at a time, and a character set consisting of that character and the next character is created, and the character set is created in ascending order of 1, 2, 3,
  • Two special characters EM end mark
  • EM end mark
  • the search unit is obtained from the search unit number, character set position ⁇ , and attribute number given above. Convert the character set to a code consisting of integers and create character set position information.
  • This character set position information is as follows: When the maximum number of search unit characters is n and the maximum number of attributes is a,
  • Character set position information is provided.
  • the character set type groups are stored in the search file in the order described in FIG. 2 and FIG. Then, the character set position information is registered in each character set type group.
  • the registration of the character set position information is performed by storing the character set position information at the head of the unstored area of the corresponding character set type group. Therefore, if they are registered in search unit order, character set position information will be registered in ascending numerical order in the character set type group.
  • Fig. 5 shows an example where the character set position information of "d0cumet" described above is registered in a search file. At this time, the character set position information in each group is stored in ascending order. If the character set position information is 4 bytes, the file capacity is as shown below.
  • additional registration of character set position information is performed by adding new character set position information to the head of the unstored area of the group corresponding to each character set of the additional document.
  • deletion is performed by changing the relevant character set position information in the group corresponding to each character set of the deleted document to a special symbol (here, the ASCII code “0000”). As a result, additional registration and deletion can be performed in a short time.
  • the character set position information stored for each character set type group in this search file can be obtained by extracting the leading lands of each character set group in the character set group address table in Fig. 4 as a directory. it can.
  • Fig. 6 shows the flow of the above search file creation process.
  • the search input character string is decomposed from the first character into a character set consisting of three characters, and a search input character set string is created.
  • the search input character set sequence is rearranged in order from the character set with the lowest occurrence frequency of all sentences.
  • the search unit Ban and the character position number that indicates the position of each character in the character set from the first character in the search unit are output as search results.
  • the search input string is decomposed into a character set consisting of three characters from the first character so that it can be compared with the character set stored in the search file, and is used as the search input character set.
  • a character set is divided into three character units from the first character, the last character set may be shorter than three characters, and a character set of three character units may not be created.
  • the characters for the number of underscores are extracted from the last part of the character set immediately before the last character set, and are connected to the front part of the last character set to create a three-character unit character set.
  • the character set indicating the start address of each character set type group in the search file is referred to, the character set group start address in the group address table is referred to, the frequency of full-text occurrence of each search input character set is checked, and the search input is performed. Sort the character set sequence in ascending order of occurrence frequency of full text.
  • the first address in the character set group address table indicates the first address of each character set type group stored in the search file. The difference between each character set From the number of character set position information stored in the species group, the frequency of character set types appearing in all sentences can be determined.
  • the number of matches with the character set position information of each character set stored in the search file can be extremely reduced by performing collation matching from character sets with low occurrence frequency of all sentences. That is, when checking the continuity of each character set by comparing the character set position information, the search unit number, the character set position identification number, and the attribute number in the character set position information in the two character set type groups are compared. Therefore, if the number of character set position information stored in the two character set type groups is small, the number of times of collation can be reduced accordingly. Therefore, when collating character set position information, collation is performed from a character set with a low frequency of full-text appearance, thereby reducing the number of times of collation. In particular, as the number of search input characters increases, the rate of inclusion of a character set with a low appearance frequency increases, so the reduction effect is large.
  • the character set position information stored in each character set type group is extracted by referring to the character set group address table from the character set with the lowest occurrence of full text. Then, based on the extracted character set position information, from the character set type group in which the whole text appears very infrequently, the search unit is the same for each character set type group and the difference in the character set position number is the search input character. A combination of character set position information that is equal to the difference in the first character position of the corresponding character set in the column is extracted.
  • the comparison of the character set position information difference is as follows:
  • the comparison of the character set position information difference between the character set type groups is based on the character set position information of the character set type group with a low frequency of full-text occurrence and the frequency of full-text appearance. Compare the character set continuity by taking the difference from the character set position information of the character set type group with the highest degree.
  • the number of times of collation is reduced by deleting discontinuous character set position information from the collation target.
  • the number of matches between these two groups is only 7 times in total, and it is not necessary to check all character set position information in the group.
  • the character set position information that matches the attribute specified in the search input can be extracted.
  • the search unit reference and the character position reference indicating the position of each character in the character set from the first character in the search unit are extracted as search results. If there is more than one search input, for the second and subsequent search inputs, the search unit questions obtained so far from the character set type group corresponding to the first character set of the search input After extracting the character set position information with, the processing after the character set next to the search input is performed. This is all about extracting the character set included in the same search unit as the search result obtained by the first search input from the second and subsequent search inputs.
  • the search unit of the search unit number “8” to which this character string belongs and the character position number “121 to 127” are output as the search results.
  • This search processing operation is shown as a flowchart in FIG.
  • the search input is taken out, the search input character string is divided into character sets in units of three characters from the first character, and a search input character set string is created.
  • the number ai is set, and the appearance frequency of each character set is checked with reference to the character set group address table and sorted in ascending frequency (S41 to S44). Then, the character set position information stored in the character set type group corresponding to the rearranged character set is extracted from the search file (S45).
  • the character set position identification of the character set position information of the character set with the low frequency of full-text occurrence in the search input character set string is i
  • the character set with the high frequency of full-text search is
  • character set position number of character set is j
  • the character set position information of the attribute board ai is selected from the character set position information, and the search unit and character set that match the search input are selected.
  • a character position number indicating the position of each constituent character from the first character in the search unit is output as a search result. (S49, 50). If the collation is continued in step S48, the character set position information of the previous matching result and the character set type group corresponding to the next character set in the character set in which the search input has been rearranged. The collation is performed with the character set position information stored in (S46).
  • the Japanese character string is a character string containing kanji. For this reason, focusing on kanji, kanji has more character types than Western characters, and the frequency of repeated occurrences of the same kanji is very low compared to Western characters that use characters. For example, even though there are many terms that use the two characters "communication" in Japanese character strings, the character string "communication " is the same in four characters, such as "communication line” and "communication device". The frequency of occurrence of the character is very low. Kana characters or hiragana characters also have more character types than European characters.
  • the search process can be performed quickly even if the search process is performed using a search file with a character type configuration of each kanji character or a character set search file with two sentences ⁇ Can be
  • This second embodiment a description will be given of search file creation and search processing using a character set composed of two characters.
  • This second embodiment is basically the same as the first embodiment in which a character set consisting of three characters is processed. However, the difference is that a search file and a character set group address table are created using the JIS code table because Japanese processing is performed.
  • the search file of the second embodiment is composed of a character set group arranged in the character order described in the JIS code table as shown in FIG.
  • Each character set group is a character set group consisting of two character strings with the written characters as the first character in the order shown in the JIS code table as shown in the character set group list in Fig. 9. Structure Is done.
  • the character set group address table shown in FIG. 10 is a table in which the head addresses of the character set type groups are arranged in the order in which they are described in the character set group list in FIG.
  • "Correspondence” in this character string is decomposed into the character set of "Communication”, “Response”, “Document”, and “Calligraphy”.
  • character set position information of “801215”, “801225”, “801235”, and “801245” is given, and the character set position information is stored in the area of the search file.
  • Figure 11 shows an example of storing the character set position information of this “correspondence document” in a search file. Since the procedure of the search file creation process is the same as that of the first embodiment, the flowchart is omitted.
  • the input search input character string is decomposed from the first character into a character set in units of two characters, and a search input character set string is created.
  • a character set type group corresponding to the character set is extracted from the search file and collated, a combination of character set position information that can form a search input character set string is extracted, and a search input is performed from the extracted character set position information.
  • the character set position information having the same attribute as the force is extracted as a collation match.
  • search unit identification and character position identification indicating the position of each character in the character set from the first character in the search unit are output as search results.
  • the last character set When a search input string is decomposed from the first character to a character set of two characters, the last character set may become one character, and a character set of two characters may not be created. In this case, one character is taken from the last part of the character set immediately before the last character set, and is connected to the front part of the last character set to create a two-character unit.
  • the search input character set is "communication" and "document”.
  • the appearance frequency of the full text is “communication” ⁇ “document”, and the matching is performed in this order, first, the character set group field of “communication” and the character set group of “document” in the search file Field and the character set position information Since the character positions of “tsuru” and “sentence” in the search input “correspondence” are “1-” and “3”, respectively, The character set information that becomes “” is extracted, and the character set position information “801215” in the “communications” of the search file in FIG. 11 and “801235” in the “document” are connected. Can be extracted as a combination.
  • search condition is “body”
  • the position numbers “121 to 124” are extracted as a search result.
  • the procedure of the search process is the same as that of the first embodiment, so that the flowchart is omitted.
  • the third embodiment is different from the second embodiment in that a search file of a character set type is formed or a search file of one character fi ⁇ ! I is created.
  • the processing is basically the same.
  • the character address table and the search file are slightly different from those in the second embodiment because a character type group is generated for each character.
  • the characters constituting the entire Japanese text are classified, the appearance frequency is counted for the character types described in the JIS code table, and the area for the search file is secured.
  • a character column address table in which the head addresses of the character type groups corresponding to FIG. 10 of the second ⁇ M example are arranged in the order described in the JIS code table is created as shown in FIG.
  • This character The column address table is a character column address of the second embodiment.In comparison with the table, the starting address is described for each character type, and since the number complies with JIS Level 1 and JIS Level 2, unused codes are used. Only the number of No.8836 character fields is required.
  • character position information code ⁇ search unit number x n + character position identification code ⁇ x a + attribute number
  • Character type groups are stored in the search file in the order described in the JIS code table based on the character column address table shown in FIG. As a result, a search file shown in FIG. 13 in which character position information is stored by being divided into character type groups is created.
  • Figure 14 shows a flowchart of this search file creation process.
  • the head address of the character column in the character column address table corresponding to each constituent character of the search input character string is calculated.
  • the search input characters system IJ are rearranged from those with low appearance frequency, character position information stored in the character type group corresponding to each character is extracted, and based on the extracted character position information, low occurrence frequency
  • the search unit is the same for each character type group, and the difference in character position number is equal to the character position difference in the search input string! /, Extract combinations of character position information.
  • the collation of this character position information is as follows. When the character position number of the / ⁇ character is i and the character position identification number of the character with the highest frequency of full text is j,
  • character position information having a common search unit and character continuity between character type groups is extracted, and character position information having the same attribute as the search input is extracted from the extracted character position information.
  • a search unit and a character position that match the search input are extracted from the character position information that matches.
  • the full-text appearance frequency of each character is in the order of “writing”, “sentence”, “shin” ⁇ “tsu”, and the collation is performed in this order.
  • the difference between the character position information extracted from the character column of “Book” and the character position information extracted from the character column of “Sentence” in the search file using the above equation (5) is “1-10”.
  • the character position information “801245” in the “book” of the search file and “801235” in the “sentence” can be extracted as continuous character position information.
  • a search file for each kanji character and for a continuous katakana character and a hiragana character as a two-character set.
  • katakana characters are often used as technical terms, and kana characters may be entered as search input character strings.
  • continuous katakana and hiragana characters are used.
  • Creating a search file as a two-character set is also effective for speeding up the search.
  • An example of a book search system will be described as a multi-keyword information search method. Records in the book search system consist of keywords such as book title, author name, publisher name, year of publication, and abstract. Then, each record containing this keyword is registered to create a search file, and a key word or a partial character string of the key word is input as a search input to search and output a corresponding record. The creation of this search file will be described.
  • record identification codes are assigned to the records to be searched in ascending order according to the registration order.
  • a keyword type code indicating the attribute is assigned with the logical type of the keyword included in each record as an attribute.
  • keyword attribute codes indicating attributes such as the book title, author name, publisher name, publication year, and abstract are assigned, and a logical association is made between the search input and the keywords of the book search system. ing.
  • the searcher specifies a keyword for storing the book to be searched for as a search input.
  • the keyword is decomposed into one character or character set, and each character indicates the character position order code indicating the character position from the beginning of the keyword, or each character set indicates the first character position of each character set from the beginning of the keyword.
  • Character set Position sequence code is assigned. These record identification code, keyword attribute code, statement Character position information for each character of the keyword or character set position information for each character set is generated from the character position sequence code or character set position sequence code. At this time, the first character position of the key word preset for each key character code is added to the character position information or character set position information as a constant so that the key character can be represented by the character position. .
  • This character position information or character set position information is grouped by character type or character set type, and these groups are assembled to create a search file. Therefore, this search file has a file structure in which character position information is stored for each character type or character set position information for each character set type.
  • a test ⁇ power character string and a search input character string attribute are input ⁇ 1 each.
  • the search input string is decomposed into individual characters or character sets, and the same character position information as the characters that make up the search input from the search file or the same character set that makes up the search input retrieves the character set position information of a character set.
  • the record identification code and the keyword attribute code are common, and the character position code is the same.
  • the I-order code or character set position sequence code is in the same order as the character position sequence code or character set position sequence code of the search input string, and
  • the keyword attribute code is collated and extracted for character position information or character set position information that is the same as the search input. From the extracted character position information or character set position information, record identification codes common to all search input character strings are extracted as search results.
  • the constituent characters of each keyword are changed from the first character of the keyword sequence to the keyword sequence created from the multi-keywords possessed by the record to be searched for the search process.
  • a character set is created by taking out characters one by one and creating a character set with a total of three characters consisting of that character and the following character, and creating a search file consisting of character set type groups grouped for each of these character set types.
  • this search file creation processing includes (1) securing a search file area, (2) assigning character set position information to each character set character set, and (3) character sets grouped by character set type. Storage of search location information in search files.
  • the search file is composed of an ASCII code table and a character set group arranged in the order of the characters listed on the table.
  • the second and third characters of each character set group are configured as shown in the second and third character combination list of the character set group in FIG. 3, as in the first embodiment. They are arranged in the order described in the toggle address table.
  • the character set position information described here composes each key in a key word sequence created by arranging each key of the record in a key attribute area corresponding to the key attribute number.
  • the record number will be described.
  • a general book search system searches books using keywords such as book name, author name, publisher name, year of publication, and abstract.
  • the record is a search target composed of the keywords of book title, author name, publisher name, publication year, and abstract. No.
  • a searcher specifies a book to be searched by using a keyword as a search input or by searching for a stored keyword.
  • the book search system adds keyword attributes to keywords such as the book name, author name, publisher name, year of publication, and abstract, for example, and allows search input and book search systems.
  • keywords such as the book name, author name, publisher name, year of publication, and abstract, for example, and allows search input and book search systems.
  • keywords There is a logical association between the keywords in the stem.
  • “1” is assigned to the book name, “2” to the author name, “3” to the publisher name, “4” to the publication year, and “5” to the abstract as the keyword ⁇ gender.
  • the character set position identification code For each keyword, extract one character at a time from the beginning of the keyword, create a character set with a total of three characters consisting of that character and the following character, and assign a ban number in the order of creation 1, 2, 3 The character set position number. To the last character of the keyword, two special symbols EM (end mark) indicating the end of the keyword are added, concatenated with this EM symbol to form a character set, and the character set position “Ban” is given. The EM symbol is assigned “ASCII code“ 7 F ”of DEL_l in the ASCII code table. Next, the keyword string will be described.
  • a character string is formed by connecting all the keys of the record, A column. That is, the keywords are arranged in a fixed-length keyword attribute area corresponding to the keyword attribute number, and a keyword sequence is created. Thus, the attribute of the keyword to which the character set belongs can be determined from the character position in the keyword string. Note that, following each keyword attribute area, an EM symbol indicating the delimitation of the keyword attribute area is arranged in a keycode row. This EM symbol is the same as the special symbol EM indicating the end of the key.
  • the character set position information is created by converting all the character sets constituting the keyword from the record number, the keyword attribute number, and the character set position number to codes consisting of integers.
  • This character set position information is an integer code given by the following equation (6).
  • the keyword sequence is as shown in FIG.
  • the character set position information of each character set is configured as shown in FIG.
  • the character set position information is composed of four-byte codes in this way, it is possible to handle 2 32 ⁇ 1169.36 million keyword strings with 1169 characters ⁇ ) o
  • the character set position information assigned to each character set is registered in a search file.
  • the character set type groups are stored in the search file in the order described in the ASCII code table shown in Figs. Then, the character set position information of each character set is registered in each character set type group. The registration of the character set position information is performed by storing the character set position information at the head of the unstored area of the corresponding character set type group. For this reason, if record records are given in the order of registration, character set position information will be registered in ascending numerical order in the character set type group.
  • Figure 18 shows an example of registering the character set position information of the above-mentioned book name "Electronicc Publishng" in a search file.
  • the character set position information in each group is stored in ascending order.
  • This file size is, if the character set position information is 4 bytes,
  • a new code at the head of the unstored area of the group corresponding to each character set of each keyword in the additional record is added. Do with. Deletion can be performed by changing the character set position information in the group corresponding to each character set of each key of the deleted record to a special symbol (here, ASCII code "0000"). Do. As a result, addition and deletion can be performed in a short time.
  • each character set in this search file can be obtained by extracting the first banji of each character set group in the character set group address table of FIG. 4 shown in the first embodiment as a directory.
  • Figures 19a and 19b show the flow of the JSLL search file creation process.
  • the frequency of occurrence of the character set type is counted to create a character set column address table (S111, 112), and an area for the search file is secured (S113).
  • the character set column directory (character set column heading area) indicating the character set ⁇ of the search file that stores the character set type group of the character set at the character set position number P is written.
  • the character set position information is extracted from the set column address table (S120), and the character set position information is stored in the first line of the unstored area of the search file indicated by the character set column directory (S121).
  • the process proceeds to the next keyword processing (S124, S125).
  • the registration processing is completed (S126).
  • the search process has the following configuration, as in the first embodiment.
  • the search input character string is decomposed from the first character into a character set consisting of three characters, and a search input character set string is created.
  • a character set that can retrieve a character set type group from the search file in order starting from the rearranged character set string and retrieve the input character set string from the character set position information stored there. Extract the combination of location information.
  • the search input character string is decomposed into three-character units from the first character so that it can be compared with the character set stored in the search file. I do.
  • the last character set may be less than three characters, and a character set may not be created. At this time, it extracts the missing characters from the end of the character set immediately before the last character set and concatenates them with the front of the last character set to create a three-character character set.
  • each search input character set is referred to by referring to the character set group heading area in the character set group address table indicating the first banchi of each character set type group in the search file.
  • character set position information stored in each character set type group column is extracted from the character set with a low frequency of occurrence by referring to the character set group address table. Then, based on the extracted character set position information, the record number and the key attribute number are the same and the character set position number of each character set type group is equal, in order from the character set type group with the lowest occurrence frequency. Difference is search input character
  • the character set position matching information that is equal to the first character position difference of the corresponding character set in the column is extracted with the 01 combination.
  • This character set position information collation is based on the case where the character set position number with low occurrence frequency is i and the character set position number with high appearance frequency is j in all keywords in the search input character set string.
  • the keyword attribute is verified for the character set position identification of the character set position information obtained from the character string verification. That is, if the character set position number is 1 to 64, the keyword attribute of the character set position information is the book name, and if the character set position number is 66 to 97, the keyword characteristic of the character set position information is the author. If the character set position number is between 99 and 162, the keyword attribute of the character set position information is the issuer name, and if the character set position number is 164 -167, the keyword attribute of the character set position information Is the year of publication, and if the character set position number is 169 or more: L168, it is understood that the key attribute of the character set position information is an abstract. Therefore, only the character set position information that is the same as the attribute specified at the time of retrieval and input is extracted from the character set position information obtained by character set collation.
  • the character set position extracted from the character set group “ffi” of “E 1 e” in the search file In the search input “E 1 ectro_j, the character positions of“ E ”and“ c ”are“ 1 ”and“ Therefore, the character set position information at which the character set position difference is “13” is extracted, and “116901” of the character set position information in “EI ej” and “” in “ctr” of the search file in FIG. 18 are extracted. 116904 "can be extracted as a combination of continuous character set position information.
  • the character set position information “116901”, “116904”, and “116905” are the character set in which the record number and the keyword attribute number are equal and continuous. You can see that there is. Furthermore, since the keyword attribute is "book name”, the character position identification number is 1 to 64 character set position information from the character set position information remaining in the character set string matching so far. Then, "116901", “116904", and "116905" can be extracted.
  • This search processing operation is shown as a flowchart in FIGS. 20a and 20b.
  • search input is extracted, and a search input character set string is created by dividing the character string into three-character units from the beginning of the search input character string.
  • search input character set sequence is rearranged in ascending order of occurrence frequency in all keys (S136).
  • the character set position information stored in the character set type group column corresponding to the rearranged character set is extracted from the search file (S137).
  • the frequency of occurrence in all keywords in the search input character set string is low, the character set position identification number of the character set is i, and the character set position identification number of the character set with high frequency is j.
  • the position information is taken out (S138).
  • the same process is performed for the remaining character sets in the search input character set string (S139, S140), and the character set position number is determined from the remaining character set position information by the keyword attribute board a. in character position range P a out takes only record one de trial No..
  • the following equation (9) is used to extract the character set position identification from the character set position information.
  • search processing for other phonetic characters can be performed in the same manner.
  • the fifth ⁇ M example is the same as the relationship of the second example with respect to the first example.
  • a search is performed according to a JIS code table using a character set of two characters. Create a file.
  • the search file creation processing and search processing procedure of the fifth embodiment are the same as those of the fourth embodiment except that the number of keyword characters and the setting of the keyword attribute area are different.
  • it is effective to use a two-character set search file in the search processing of Japanese documents that use Kana characters and Kanji whose character types are more common than European characters.
  • kana characters may be used as the character set search file according to the fifth embodiment
  • kanji may be used as the character type group search file for each character according to the sixth embodiment. .
  • the sixth embodiment has the same relationship as the first embodiment and the third embodiment with respect to the second embodiment.
  • character position information is stored in units of one character.
  • a search file composed of character type groups is used.
  • the sixth embodiment creates character position information in units of one character. Therefore, the character position information is represented by character position information code-record number. XH + (P a-1) + p
  • the character position information is configured as shown in FIG. Fig. 25 shows an example in which the character position information of the book name "Correspondence of communication document" is registered in the search file.
  • FIGS. 27a and 27b show a flowchart of the search process.
  • the procedure of the search file creation process and the search process is basically the same as in the fourth embodiment.
  • the search file is composed of character type groups in units of one character! , Ru point contact and is different in that it is constructed on the basis of the JIS code for Japanese processing c [INDUSTRIAL APPLICABILITY]
  • the present invention provides a character set consisting of a search unit identifier: a symbol, a character set position order code, and an attribute number indicating the number of search units to which the character set belongs for each character set type of the character string to be searched.
  • Create a search file that stores location information search this search file, extract the character set location information for each character set type that composes the input character string, and search for a character string that matches the search input .
  • create a search file that stores character position information for each character type, extract the character position information for each character type that constitutes the character string of the search input, and match the search input Search for character strings.
  • the present invention has the following excellent effects.
  • Any character string search can be performed because the search processing is performed by focusing on the character set and character position, and it is necessary to extract the character string at the time of registration as in the index method or pre-search method of full-text search processing Flower
  • a high-speed search can be realized only by software without using dedicated hardware, so that a full-text search can be efficiently performed by a general-purpose information processing device, and the versatility is high.
  • a character string consisting of characters with few character types, such as European characters, can also be searched by creating a search file that stores character set position information in the character set type group that composes the character string.
  • the frequency of occurrence of the same character string is low, so that the frequency of appearance of each character set can be kept low, and search matching can be performed in a character set with a low frequency of appearance, thus enabling high-speed search.
  • the search process Since the search process only needs to extract the character position information or character set position information of the corresponding character or character set of the search input character string, the character position information or character set of the corresponding character type in the search file is retrieved. Even when the character set position information of the data is in the external storage device, the time required to transfer the contents of the search file to the main memory is reduced, and the search process can be sped up.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Dispositif d'extraction de données permettant l'extraction rapide et l'interclassement arbitraire de chaînes de caractères d'une base de données dans un mode d'extraction de phrases entières ou dans un mode utilisant une pluralité de mots clés. Une chaîne de caractères comprenant des mots clés à extraire est divisée en caractères ou jeux de caractères respectifs constitués d'une pluralité de caractères. Pour chaque caractère ou jeu de caractères, on génère des informations de position de caractère comportant un code d'identification de l'unité de chaîne de caractères à extraire à laquelle appartient ledit caractère ou jeu de caractères, un code de séquence pour la position des caractères indiquant la position du caractère dans la chaîne de caractères, ainsi qu'un code de caractéristiques indiquant le découpage logique de la chaîne de caractères. Ainsi, on prépare à l'avance un fichier d'extraction dans lequel les informations de position de caractère sont groupées selon chaque type de caractère ou de jeu de caractères. Pour une demande d'extraction, on extrait du fichier d'extraction les informations de position de caractère des caractères ou jeux de caractères comportant la demande d'extraction, pour interclasser avec ceux-ci la demande d'extraction, et l'on extrait du fichier d'extraction la chaîne de caractères de l'objet de l'extraction qui est continu et dont le code de caractéristiques coïncide avec ladite demande d'extraction. Ainsi, on peut diminuer le nombre d'interclassements des chaînes de caractères et assurer une extraction rapide à coïncidence partielle ou une extraction rapide de phrases entières.
PCT/JP1991/000011 1990-11-30 1991-01-10 Dispositif d'extraction de donnees WO1992009960A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP2/338546 1990-11-30
JP2338546A JPH0782504B2 (ja) 1990-11-30 1990-11-30 情報検索処理方式および検索ファイル作成装置
JP2/417609 1990-12-12
JP2417609A JPH07109603B2 (ja) 1990-12-12 1990-12-12 情報検索処理方式および検索ファイル作成装置

Publications (1)

Publication Number Publication Date
WO1992009960A1 true WO1992009960A1 (fr) 1992-06-11

Family

ID=26576122

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP1991/000011 WO1992009960A1 (fr) 1990-11-30 1991-01-10 Dispositif d'extraction de donnees

Country Status (1)

Country Link
WO (1) WO1992009960A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0595539A1 (fr) * 1992-10-30 1994-05-04 AT&T Corp. Un procédé séquentiel de recherche dans une mémoire à motifs et de gestion de stockage
US5913216A (en) * 1996-03-19 1999-06-15 Lucent Technologies, Inc. Sequential pattern memory searching and storage management technique
CN111369980A (zh) * 2020-02-27 2020-07-03 网易有道信息技术(北京)有限公司江苏分公司 语音检测方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4554631A (en) * 1983-07-13 1985-11-19 At&T Bell Laboratories Keyword search automatic limiting method
US4606002A (en) * 1983-05-02 1986-08-12 Wang Laboratories, Inc. B-tree structured data base using sparse array bit maps to store inverted lists
JPS6435627A (en) * 1987-07-31 1989-02-06 Fujitsu Ltd Data retrieving system
JPS6435626A (en) * 1987-07-31 1989-02-06 Fujitsu Ltd Word retrieving system
JPS6436329A (en) * 1987-07-31 1989-02-07 Nec Corp Character string registration retriever

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4606002A (en) * 1983-05-02 1986-08-12 Wang Laboratories, Inc. B-tree structured data base using sparse array bit maps to store inverted lists
US4554631A (en) * 1983-07-13 1985-11-19 At&T Bell Laboratories Keyword search automatic limiting method
JPS6435627A (en) * 1987-07-31 1989-02-06 Fujitsu Ltd Data retrieving system
JPS6435626A (en) * 1987-07-31 1989-02-06 Fujitsu Ltd Word retrieving system
JPS6436329A (en) * 1987-07-31 1989-02-07 Nec Corp Character string registration retriever

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
I. FLORES, "Data Management", 10 August 1972, TAKEUCHI SCHOTEN (TOKYO), p. 201-220, (I. FLORES, "Data Structure and Management", 1970, PRENTICE-HALL). *
MASAYUKI TAKEDA, "High-speed Pattern Matching Algorithim for Total Text Processing", 1991, Treatises from Informatics Symposium Lecture, 8 January 1991. *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0595539A1 (fr) * 1992-10-30 1994-05-04 AT&T Corp. Un procédé séquentiel de recherche dans une mémoire à motifs et de gestion de stockage
US5913216A (en) * 1996-03-19 1999-06-15 Lucent Technologies, Inc. Sequential pattern memory searching and storage management technique
CN111369980A (zh) * 2020-02-27 2020-07-03 网易有道信息技术(北京)有限公司江苏分公司 语音检测方法、装置、电子设备及存储介质
CN111369980B (zh) * 2020-02-27 2023-06-02 网易有道信息技术(江苏)有限公司 语音检测方法、装置、电子设备及存储介质

Similar Documents

Publication Publication Date Title
Robertson et al. Applications of n‐grams in textual information systems
US4775956A (en) Method and system for information storing and retrieval using word stems and derivative pattern codes representing familes of affixes
JP3160201B2 (ja) 情報検索方法、情報検索装置
US5590317A (en) Document information compression and retrieval system and document information registration and retrieval method
US5995962A (en) Sort system for merging database entries
US5523946A (en) Compact encoding of multi-lingual translation dictionaries
US20090193005A1 (en) Processor for Fast Contextual Matching
JPH08249354A (ja) 単語索引および単語索引作成装置および文書検索装置
Keskustalo et al. Non-adjacent digrams improve matching of cross-lingual spelling variants
JP2833580B2 (ja) 全文インデックス作成装置および全文データベース検索装置
JP2669601B2 (ja) 情報検索方法及びシステム
JP3220865B2 (ja) フルテキストサーチ方法
JPH04205560A (ja) 情報検索処理方式および検索ファイル作成装置
JPH0740275B2 (ja) キーワード重要度自動評価装置
Hockey et al. The Oxford concordance program version 2
JP2519130B2 (ja) マルチキ―ワ―ド情報検索処理方式および検索ファイル作成装置
Robertson et al. A comparison of spelling-correction methods for the identification of word forms in historical text databases
JP2519129B2 (ja) マルチキ―ワ―ド情報検索処理方式および検索ファイル作成装置
WO1992009960A1 (fr) Dispositif d'extraction de donnees
JPH04326164A (ja) データベース検索システム
JP3081093B2 (ja) 索引作成方法およびその装置と文書検索装置
JPH04215181A (ja) 情報検索処理方式および検索ファイル作成装置
JPH03150668A (ja) 検索システムの入力文字列正規化方式
JPH10177575A (ja) 語句抽出装置および方法、情報記憶媒体
Williams et al. Document retrieval using a substring index

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): DE FR GB

NENP Non-entry into the national phase

Ref country code: CA