WO2017126057A1 - Information search method - Google Patents

Information search method Download PDF

Info

Publication number
WO2017126057A1
WO2017126057A1 PCT/JP2016/051566 JP2016051566W WO2017126057A1 WO 2017126057 A1 WO2017126057 A1 WO 2017126057A1 JP 2016051566 W JP2016051566 W JP 2016051566W WO 2017126057 A1 WO2017126057 A1 WO 2017126057A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
word
excluded
character string
column
Prior art date
Application number
PCT/JP2016/051566
Other languages
French (fr)
Japanese (ja)
Inventor
岐勇 飯島
敦 畠山
翔太 葛西
裕介 水藤
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2016/051566 priority Critical patent/WO2017126057A1/en
Publication of WO2017126057A1 publication Critical patent/WO2017126057A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to an information search method.
  • search word a search target character string
  • search word a search target character string
  • an index of a target document is created, and a method for searching for a document requested by a user by referring to the index when a search request is made by a user has been mainstream.
  • the information search cannot be performed unless the document is indexed in advance, and therefore, a full-text search technology that matches the search target document with the search term when a search request from a user is accepted also appears. ing.
  • search noise words that are irrelevant to the search word
  • search noise a word that includes the character string “log” but is not related to “log” is also searched, such as “program”. This is a particularly common problem when searching for documents written in a language where there is no space between words, such as Japanese.
  • Patent Literature 1 specifies a word specifying dictionary and a word including a search word character string as a partial character string (hereinafter, this word is referred to as an “extended word”).
  • this word is referred to as an “extended word”.
  • a technique for removing an extension word from a search result by using the extension word dictionary is disclosed. For example, it is assumed that “log” is designated as a search word character string, and an extension word (for example, “program”) including “log” as a partial character string is registered in the extension word dictionary. In this case, since the “program” included in the search target document is excluded from the search result and reported, the search noise can be reduced.
  • Patent Document 1 In the technology disclosed in Patent Document 1, it is necessary to have an extension word dictionary in advance in order to remove noise from the search result. Conversely, when a word that is not registered in the extended word dictionary is included in the search target document, the word cannot be removed as noise. In addition, having an extended word dictionary is expensive.
  • the information search method of the present invention receives the specification of the search word, the position including the specified search word and the character string adjacent to the position (adjacent character string) are specified from the search target data. To do. Then, based on the characteristics of the adjacent character string, it is determined whether the adjacent character string is a character string constituting a word (excluded word) to be excluded from the search target, and when it is determined as a character string constituting the excluded word Determines a word obtained by connecting the character string and the search word as an excluded word candidate.
  • search noise can be reduced.
  • FIG. 12 is a flowchart of excluded word candidate sorting processing according to the second embodiment. It is an example of the result display screen in Example 2.
  • processing executed by a computer such as a host may be described using “program” as the subject.
  • program the main subject of the processing is the processor (CPU), but in order to prevent redundant explanation
  • the contents of the process may be explained using the program as the subject. Further, part or all of the program may be realized by dedicated hardware.
  • Various programs described below may be provided by a storage medium that can be read by a program distribution server or a computer, and may be installed in each device that executes the program.
  • the computer-readable storage medium is a non-transitory computer-readable medium such as a non-volatile storage medium such as an IC card, an SD card, or a DVD.
  • FIG. 1 is a diagram illustrating a hardware configuration of the information search system.
  • the information search system includes a search system server 1 (hereinafter abbreviated as “server 1”), a search client 2 (hereinafter abbreviated as “client 2”), and a storage device 3.
  • the server 1 and the client 2 are connected so as to be able to communicate with each other via a local area network (LAN) 4 configured using, for example, Ethernet.
  • LAN local area network
  • the server 1 is connected to the storage device 3 via a network 5 (or referred to as SAN 5) configured using, for example, a fiber channel (FibreChannel).
  • the server 1 is a computer that performs a search process based on an information search request received from a user of an information search system (hereinafter referred to as “user”).
  • the server 11 is referred to as a main memory (hereinafter referred to as “memory”). 12), a network port 13 for connecting to the LAN 4, an input device 14, an output device 15, and a storage port 16.
  • the memory 12 is a storage device such as a DRAM, for example, and is used to store the program or control information used when the program is executed when the CPU 11 executes the program.
  • the CPU 11 is a component that executes a program for executing search processing.
  • the input device 14 is a device used when the user inputs information, such as a keyboard and a mouse.
  • the output device 15 is, for example, a display (output) device such as a display or a printer. In this embodiment, the output device 15 is a display.
  • the storage port 16 is an interface for connecting the server 1 and the storage device 3.
  • the client 2 is a computer that is used by the user to instruct the server 1 to search for information or to receive an output of a search result from the server 1.
  • the client 2 includes a CPU 21, a main memory 22, a network port 23 for connecting to the LAN 4, an input device 24, and an output device (display) 25.
  • the CPU 21, the memory 22, the network port 23, the input device 24, and the output device 25 are the same as the CPU 11, the memory 12, the network port 13, the input device 14, and the output device 15 of the server 1, respectively.
  • the client 2 may include an auxiliary storage device such as a magnetic disk.
  • the storage device 3 is a device having a nonvolatile storage device such as a magnetic disk, and is a device for storing information to be searched by the user. In the present embodiment, information to be searched by the user is referred to as “search target data” (“search target data 31” in the figure).
  • the storage device 3 may be a device having a plurality of nonvolatile storage devices such as a so-called disk array (or RAID).
  • the storage device 3 is connected to the storage port 16 of the server 1 via the SAN 5.
  • the memory 12 of the server 1 stores a program executed by the server 1 and control information used by the program.
  • Examples of programs executed by the server 1 include a search program 120, an excluded word candidate extraction program 121, and an excluded word candidate sort program 122.
  • these programs are executed by the CPU 11.
  • Control information used by these programs includes hit location information 500, an excluded word candidate list 550, and appearance frequency information 600. Details of these will be described later.
  • the server 1 may store in the memory 12 programs other than the programs described above and information other than the control information described above.
  • the client program 221 exists in the main memory 22 of the client 2, and the CPU 21 executes the client program 221.
  • the client program 221 is a program that provides a GUI (Graphical User Interface) for a user to issue an information search instruction.
  • the search target data 31 targeted in the information search process according to the present embodiment is tabular data composed of a plurality of rows and a plurality of columns, as shown in FIG.
  • the search target data 31 has a plurality of columns 301 to 306.
  • the top row of the search target data 31 is the column name of each column 301-306.
  • the columns 301 to 305 are columns that store typical information (structured data) such as names and dates.
  • a column 306 stores text written in a natural language.
  • the information stored in the column 306 is atypical data (unstructured data).
  • Information to be searched in the information search system according to the present embodiment is unstructured data stored in the column 306.
  • the columns 301 to 305 may store information expressed as numerical values such as dates, or may store information other than numerical values such as names. However, in each embodiment described below, even if the information stored in the columns 301 to 305 is character information other than numerical values (for example, a character string such as “search” or “registration” is recorded in the column 304). The information stored in the columns 301 to 305 is referred to as “value”.
  • unstructured data may be stored in the columns 301 to 305.
  • the search target data is tabular data as shown in FIG. 2 as shown in FIG. 2
  • the search target data is not limited to tabular data.
  • data described in XML or the like may be used, or even a single text file or a collection of a plurality of text files may be a search target in the information search system according to the present embodiment.
  • FIG. 3 shows an example of a search word input screen provided to the user by the information search system.
  • An input screen 400 in FIG. 3 is a screen that the client program 221 outputs (displays) to the output device 25 of the client 2, and the user can input a search term in the input field 401.
  • a plurality of search terms may be entered in the input field 401.
  • the client program 221 When the user presses the button 402 using the mouse or keyboard that is the input device 24, the client program 221 notifies the server 1 of the search term input in the input field 401 and causes the server 1 to perform a search.
  • the information search system obtains the number of cases in which the search term input in the input field 401 is included from the atypical text information stored in the column 306 of the search target data 31.
  • FIG. 4 shows an example of the result display screen 450.
  • a search term specified by the user on the input screen 400 is displayed in the viewpoint column 451, and a numerical value 452 (referred to as the number 452) displayed below the viewpoint column 451 is a search included in the search target data 31. This is the number of words (words displayed in the viewpoint column 451).
  • the result display screen 450 displays the number 452 for each of the three search terms.
  • a column 453 on the right side of the viewpoint column 451 is a column for displaying candidates for excluded words.
  • a word character string
  • server 1 a word including a search word searched from the search target data 31 by an information search system (specifically, server 1) is called an excluded word candidate.
  • “Log” is displayed in one of the viewpoint columns 451.
  • the server 1 finds a word including “log” from the search target data 31 (“program”, “blog”, “application log” in the example of FIG. 4), it extracts it as an excluded word candidate. .
  • the server 1 counts the number of excluded word candidates existing in the search target data 31.
  • the excluded word candidates and the number thereof are sent to the client program 221, and the client program 221 displays the excluded word candidates in the excluded word candidate column 453, and the number of excluded word candidates is the number of excluded words on the right side of the excluded word candidate column 453. This is displayed in the column 454. Further, the value displayed in the number of cases 452 at this time includes the number of excluded word candidates.
  • a check box 455 is displayed on the right side of the excluded word count column 454.
  • a check box 455 exists for each candidate for excluded word.
  • the word displayed in the excluded word candidate column 453 of the line in which the check box 455 is turned on is Excluded from search.
  • the number of excluded word candidates (the number displayed in the excluded word number column 454) in the row where the check box 455 is turned on is subtracted from the value displayed in the number of cases 452.
  • excluded word candidates those excluded from the search target, that is, those whose check box 455 is turned on by the user are referred to as “excluded words”.
  • the flow of processing executed by the information search system will be described with reference to FIG.
  • a search instruction is transmitted from the client program 221 to the search program 120, and the processing of FIG. 5 is started accordingly.
  • the alphabet “S” attached before the reference number means “step”.
  • the information search system according to the present embodiment is configured to accept a plurality of search terms.
  • Step 10 The search program 120 reads the search target data 31 and determines whether the search target data 31 (column 306) includes a word that matches the search word.
  • the search program 120 does not need to read all the search target data 31, and only needs to read the text stored in the column 306.
  • a word that is included in the search target data 31 and matches the search word is referred to as a “hit word”.
  • the contents of the search word and the hit word are naturally the same, but not the word specified by the user (search word) but the character string (word that matches the search word) existing in the search target data 31 and its position
  • the expression “hit word” is used when the user wants to specify.
  • hit location information 500 An example of hit location information (hereinafter referred to as “hit location information 500”) stored in the memory 12 is shown in FIG. The hit location information 500 is created for each search term.
  • the search term specified from the client program 221 is stored in the search term 501, and the length (number of characters) of the search term is stored in the Length 502.
  • Length 502 one or more pieces of information (hit location 503) about the position where the hit word exists are stored.
  • the hit location 503 is composed of two pieces of information, a row number 503-1 and an offset 503-2.
  • the line number 503-1 is a number stored in the column 301 of the search target data 31, and the offset 503-2 represents the position from the head of the text stored in the column 306 of the search target data 31.
  • the search program 120 creates a hit location 503 in which “1” is stored in the row number 503-1 and “4” is stored in the offset 503-2, and is stored in the memory 12.
  • the hit location information 500 shown in FIG. 6 is an example, and the format of the hit location information may be arbitrary as long as the information that uniquely identifies the position of the hit word is recorded in the memory 12.
  • the frequency 504 stores the number of times that the search word 501 appears in the search target data 31. This is equal to the number of hit locations 503.
  • Step 20 As a result of Step 10, when the hit location information 500 is not generated (that is, when the search target data 31 does not include a search word), the search program 120 ends. When the hit location information 500 is generated (step 20: Yes), the search program 120 next executes step 30.
  • Step 30 The search program 120 calls the excluded word candidate extraction program 121 to specify an excluded word candidate.
  • the excluded word candidate extraction program 121 called from the search program 120 identifies an excluded word candidate using the hit location information 500.
  • the excluded word candidate extraction program 121 counts the number (number of appearances) of each excluded word candidate existing in the search target data 31, and the excluded word candidate is information that records the excluded word candidate and the number of appearances.
  • a list 550 is created and recorded in the memory 12.
  • the format of the excluded word candidate list 550 stored in the memory 12 is shown in FIG.
  • the excluded word candidate list 550 includes a search word 551, a Length 552, and a candidate 553.
  • a search term 551 and a length 552 are the length of the search term and the search term specified from the client program 221, respectively.
  • the candidate 553 is information composed of a candidate word 553-1 and the number of appearances 553-2. When a plurality of excluded word candidates are found, a plurality of candidates 553 exist in the excluded word candidate list 550.
  • the candidate word 553-1 is an excluded word candidate specified by the search program 120 (exactly, the excluded word candidate extraction program 121), and the number of appearances 553-2 is that the candidate word 553-1 appears in the search target data 31. Is the number of times. The contents of the processing of the excluded word candidate extraction program 121 will be described later.
  • Step 40 The search program 120 calls the excluded word candidate sorting program 122 to rearrange the excluded word candidates specified in step 30.
  • the called excluded word candidate sorting program 122 sorts each candidate 553 when there are a plurality of candidates 553 in the excluded word candidate list 550.
  • sorting is not an essential process. Therefore, in the information search system according to the first embodiment, the search program 120 may not execute step 40.
  • the sorting method is arbitrary.
  • the candidates 553 may be sorted in descending order of the number of appearances 553-2.
  • candidates 553 are stored in the excluded word candidate list 550 in descending order of appearance count 553-2.
  • Step 50 The search program 120 causes the client program 221 to display a result display screen 450.
  • the search program 120 transmits the search word 501 and the frequency 504 included in the hit location information 500 and the excluded word candidate list 550 to the client program 221.
  • a plurality of search words for example, n
  • a plurality (n sets) of sets of the search word 501, the frequency 504, and the excluded word candidate list 550 are transmitted.
  • the client program 221 that has received the information from the search program 120 displays the result display screen 450 in FIG. 4 on the output device 25 and waits for an instruction from the user.
  • the client program 221 displays the excluded word candidates in the order of the candidates 553 stored in the excluded word candidate list 550. If the candidates 553 stored in the excluded word candidate list 550 are sorted so as to be stored in descending order of the number of appearances (appearance number 553-2) in step 40, the client program 221 selects the excluded word candidate as the number of appearances. It will be displayed in order from most.
  • the client program 221 turns on the check box 455.
  • the search candidate 120 is notified of the excluded word candidate.
  • Step 60 The search program 120 calculates the number of excluded word candidates notified from the client program 221 and the result (the search word 501 and the frequency 504 included in the hit location information 500, and the excluded word candidate list 550). To the client program 221. Upon receiving the notification, the client program 221 redisplays the result display screen 450. The re-display is performed in the same manner as the processing performed in step 50.
  • Step 70 When the user turns on the check box 455 for the excluded word candidate and presses the Refresh button 460 on the result display screen 450 re-displayed in Step 60, the processing in Step 60 is performed again. If the user does not re-specify the check box 455 for the candidate exclusion word, the search process ends.
  • the excluded word candidate extraction program 121 is configured such that characters (or character strings) adjacent to the hit word included in the search target data 31 are connected to the hit word based on a predetermined rule. It is determined whether it is a character (or character string) that constitutes an excluded word candidate.
  • characters (or character strings) adjacent to the hit word are referred to as “adjacent character strings”.
  • the adjacent character string may be one character, or may be two or more characters. Of the adjacent character strings, the characters (or character strings) connected to the hit word and constituting the excluded word candidate are referred to as excluded character strings.
  • Exclusion word candidate identification rule based on character type
  • the character type here means katakana, hiragana, kanji, and the like.
  • this rule is referred to as “rule (1)”.
  • the excluded word candidate extraction program 121 can specify the character type of the adjacent character string or hit word from the character code of the adjacent character string or hit word.
  • it is determined character by character whether or not it is an excluded character string. In the following description, the character to be determined is referred to as “determination target character”.
  • the determination target character is a character indicating a delimiter (hereinafter referred to as a delimiter character) such as a punctuation mark (.,), A space, or a line feed
  • a delimiter character such as a punctuation mark (.,), A space, or a line feed
  • Delimiters include punctuation marks, spaces, line breaks, tabs, colons, semicolons, or parentheses.
  • the determination target character is not a delimiter, it is determined whether or not it is an excluded character string by comparing with a character adjacent to the determination target character. If the determination target character is the same character type as the character adjacent to the determination target character, the determination target character is determined to be an excluded character string. This is because, when both character types are the same, the determination target character is highly likely to be a character included in the same word as the character adjacent to the determination target character. Specifically, when both the determination target character and the character adjacent to the determination target character are katakana (or hiragana, kanji), the determination target character is determined to be an excluded character string. When the character adjacent to the determination target character is a kanji character, if the determination target character is any one of katakana, hiragana, or kanji, the adjacent character string is determined to be an excluded character string.
  • the determination target character is a character type different from the character adjacent to the determination target character, it is determined that the adjacent character string is not an excluded character string. Specifically, when the determination target character is hiragana and the character adjacent to the determination target character is katakana (or vice versa), the adjacent character string is not an excluded character string.
  • FIG. 8 an example of an excluded word candidate identification method based on rule (1) will be described.
  • the example in FIG. 8 assumes a case where the search term is “log” and the search target data 31 includes the character string “in this program”.
  • the characters adjacent to the front of the hit word “log” are “P”, “NO”,. Since “P” has the same character type (Katakana) as the hit word “Log”, it is determined as an excluded character string.
  • “no”, which is a character concatenated in front of “p” is a character type (hiragana) different from “p”, which is a character adjacent to “no”, and is therefore not an excluded character string. Is done.
  • rule (2) Exclusion word identification rule using 2-gram (bi-gram) method
  • an adjacent character string is an excluded character string based on the comparison result of the appearance frequency of the search word and the appearance frequency of the adjacent character string. Is determined.
  • this rule is referred to as “rule (2)”.
  • An example of an excluded word candidate specifying method using rule (2) will be described with reference to FIG.
  • FIG. 9 shows an example in which the number of occurrences (appearance frequency) of the character string “log”, which is a hit word matching the search word, appears in the search target data 31 is 12,000.
  • the appearance frequency in the search target data 31 of the character string adjacent to the hit word is first obtained.
  • the length of the adjacent character string may be arbitrary, but in this embodiment, an example in which the length of the adjacent character string is 2 will be described.
  • rule (2) as in the case of creating a so-called bi-gram index, the appearance frequency is obtained for each character string of two characters in the search target data 31.
  • the number of occurrences of “gra” and “ram”, which are two character strings located in the search target data 31, is 8000 times. That is, the number of appearances of the character strings “pro”, “gra”, and “ram” are all the same as the number of appearances of the search term “log” (12,000 times).
  • rule (2) when the number of appearances of a two-character string is approximately the same as the number of appearances of a search word, these two-character strings (in the example of FIG. 9, “pro”, “gra”, “Ram”) is determined to be an excluded character string. This is because a character string (adjacent character string) that appears at a frequency close to the appearance frequency of the search word is highly likely to be a character string that is concatenated with the search word (hit word) to form one word.
  • the number of appearances of the two-character string “NO” located in front of the character string “PRO” is 800 times, and the number of appearances of the two-character string “MU” located behind the character string “ram” Is 100 times, which is greatly different from the number of appearances of the search term “log” (12000 times). For this reason, in the rule (2), it is determined that the two character strings “NO” and “MU” are not excluded character strings. Similarly, a character string positioned in front of “NO” (such as “this”) and a character string positioned behind “MU” (such as “in”) are also determined not to be excluded characters. Is done.
  • determining whether or not the number of appearances of a two-character string and the number of appearances of a search word are similar For example, there may be a method of determining whether the absolute value of the difference between the two is 0 or more and less than a predetermined value, or a method of determining whether the ratio between the two is a value close to 1.
  • the process flow of the excluded word candidate extraction program 121 performed in step 30 will be described with reference to FIG.
  • either an excluded word candidate specifying method based on the rule (1) or an excluded word candidate specifying method based on the rule (2) can be selected.
  • the storage device 3 stores a connection rule 32, and the connection rule 32 stores information on the rules (1) and (2) described above.
  • the excluded word candidate extraction program 121 reads either rule (1) or rule (2) from the connection rule 32 based on designation from the user.
  • the administrator of the information search system different from the user determines a rule (rule (1) or rule (2)) to be read by the excluded word candidate extraction program 121, and the search program 120 (or excluded word candidate extraction program). 121), a rule to be read may be designated.
  • Step 310 The excluded word candidate extraction program 121 reads the information of the rule (1) or the rule (2) from the connection rule 32 in the storage device 3.
  • the information of rule (1) is read in step 310 (that is, a case where an excluded word candidate is specified based on rule (1)) will be described.
  • Step 320 The excluded word candidate extraction program 121 selects one hit location 503 from the hit location information 500, and specifies the appearance position of the hit word in the search target data 31. Further, the excluded word candidate extraction program 121 prepares two variables (a forward pointer H and a backward pointer T) for specifying an adjacent character string of a hit word. As described in the description of the hit location information 500, two types of information, a row number and an offset, are used to specify the position in the search target data 31, but the forward pointer H and the backward pointer T have an offset. Only stored. However, as another embodiment, a set of row number and offset may be stored in the forward pointer H and the backward pointer T.
  • the excluded word candidate extraction program 121 sets initial values for the forward pointer H and the backward pointer T.
  • the forward pointer H and the backward pointer T will be described with reference to FIG.
  • the forward pointer H is a pointer indicating a character located before the appearance position of the hit word, and the initial value stores a position one character before the start position of the hit word.
  • the backward pointer T is a pointer that points to the character located behind the hit word, and the initial value stores the position of the character next to the end of the hit word.
  • the position information of “La”, which is the adjacent character immediately after “Log”, which is the hit word, that is, 6 is stored as an initial value.
  • Step 340 The excluded word candidate extraction program 121 determines the character string pointed to by the forward pointer H.
  • the determination method is as described above.
  • the excluded word candidate extraction program 121 determines whether the selected character string is an excluded character string based on the character type. Therefore, the excluded word candidate extraction program 121 compares the character type of the character pointed to by the forward pointer H and the character adjacent to the back of the character. As a result of the comparison, if the character pointed to by the forward pointer H matches the rule (1) described above, the excluded word candidate extraction program 121 determines that the character pointed to by the forward pointer H is an excluded character string. If the character pointed to by the forward pointer H is a delimiter, the excluded word candidate extraction program 121 determines that the character pointed to by the forward pointer H is not an excluded character string.
  • Step 350 The excluded word candidate extraction program 121 updates the value of the forward pointer H so that the forward pointer H points to the character immediately preceding the character currently pointed to. Specifically, the excluded word candidate extraction program 121 subtracts 1 from the value of the forward pointer H. Thereafter, the excluded word candidate extraction program 121 executes Step 340 for the character string pointed to by the forward pointer H updated here. The excluded word candidate extraction program 121 repeats the processing in steps 340 and 350 until the character string pointed to by the forward pointer H is not determined to be an excluded character string (or until the forward pointer H becomes 0).
  • the forward pointer H points to the character “P”. Therefore, in the determination in step 340, it is determined that the character “P” is the same character type (katakana) as the search word (log), and therefore step 350 is executed next.
  • the excluded word candidate extraction program 121 causes the forward pointer H to point to the character adjacent to the front of the character “P” (that is, “NO”) (H is subtracted by 1 to 2), and again. The determination in step 340 is performed.
  • the character “NO” is a character type (Hiragana) different from the character “P” adjacent to “NO”. Therefore, when the excluded word candidate extraction program 121 determines in step 340 for the character “NO”, it is determined that the selected character “NO” is not an excluded character string. As a result, in the example of FIG. 11, the character “p” is determined to be an excluded character string, but character strings (“no” and “ko”) positioned in front of it are determined not to be excluded character strings.
  • Step 370 The excluded word candidate extraction program 121 determines the character string pointed to by the backward pointer T.
  • the determination method is the same as in step 340. That is, the excluded word candidate extraction program 121 compares the character pointed to by the backward pointer T and the character type of the character adjacent to the front of the character, and determines whether or not they are the same character type (if they are the same character type, The character pointed to by the pointer T is determined as an excluded character string). If the character pointed to by the backward pointer T is a delimiter, it is determined that the character pointed to by the backward pointer T is not an excluded character string. If it is determined that the selected character string is an excluded character string (step 370: connected), step 380 is performed next, and otherwise step 390 is performed.
  • Step 380 The excluded word candidate extraction program 121 updates the value of the backward pointer T so that the backward pointer T points to the character immediately after the character currently pointed to.
  • the reverse of the method for updating the forward pointer H is preferably performed. That is, the excluded word candidate extraction program 121 may add 1 to the backward pointer T.
  • One character adjacent to the rear of the character string determined in step 370 is selected.
  • the excluded word candidate extraction program 121 determines that the character string pointed to by the backward pointer T is determined not to be an excluded character string (or until the backward pointer T points to the end of the search target data 31). The process of step 380 is repeated.
  • the backward pointer T points to the character “La”. Then, in the determination in step 370, it is determined that the character “ra” is the same character type (katakana) as the search word (log), so step 380 is executed next. In step 380, the excluded word candidate extraction program 121 updates the backward pointer T so that the character adjacent to the character “La” (that is, “M”) is indicated (add 1 to T to 7). Then, the determination in step 370 is performed again.
  • the selected character string “M” is determined to be an excluded character string by the determination in Step 370. Therefore, the excluded word candidate extraction program 121 executes Step 380 again. By executing step 380, the backward pointer T points to the character “de”. Therefore, the determination in step 370 is performed for the character “de”.
  • the character “de” is a character type (Hiragana) that is different from the character “m” that is adjacent to the front of the character “de”, it is determined in step 370 that the character “de” is not an excluded character string.
  • the characters “La” and “M” are determined to be excluded character strings, but the character strings (“de” and “ha”) located behind the character strings are excluded character strings. It is judged that it is not.
  • Step 390 The excluded word candidate extraction program 121 adds 1 to the forward pointer H and subtracts 1 from the backward pointer T. Then, the excluded word candidate extraction program 121 determines a character string having the character pointed to by the forward pointer H as the first character and the character pointed to by the backward pointer T as the terminal character as an excluded word candidate, and the determined exclusion The word candidate is recorded in the candidate 553 of the excluded word candidate list 550. For example, when the processing up to step 380 is executed on the character string shown in FIG. 11, the forward pointer H points to the character “NO” and the backward pointer T points to the character “DE”. Yes.
  • step 390 the excluded word candidate extraction program 121 adds 1 to the offset of the forward pointer H and subtracts 1 from the offset of the backward pointer T.
  • the forward pointer H points to the character “p”, and the backward pointer T "”. Therefore, the character string “program” is determined as an excluded word candidate.
  • Step 400 The excluded word candidate extraction program 121 determines whether or not the processing of Step 320 to Step 380 has been executed for all the hit locations 503 stored in the hit location information 500. If the processing of step 320 to step 380 has not been executed for all hit locations 503 (step 400: No), the excluded word candidate extraction program 121 performs the processing from step 320 again. If the processing of step 320 to step 380 has been executed for all hit locations 503 (step 400: Yes), then step 410 is performed.
  • Step 410 The excluded word candidate extraction program 121 counts the number of appearances in the search target data 31 for each excluded word candidate recorded in the excluded word candidate list 550. Then, the excluded word candidate extraction program 121 records the counted number of appearances in the excluded word candidate list 550 (number of appearances 553-2), and ends the process.
  • the exclusion word candidate is specified according to the flow described above.
  • rule (2) is read in step 310.
  • the processing is performed in the same flow as described above, and therefore, the description below will focus on differences from the above description.
  • the excluded word candidate extraction program 121 When the rule (2) is read in step 310, the excluded word candidate extraction program 121 counts the appearance frequency of each character string in the search target data 31 and creates the appearance frequency information 600 before executing step 320. To do. An example of the appearance frequency information 600 is shown in FIG.
  • the excluded word candidate extraction program 121 decomposes all data in the search target data 31 (in the column 306) into character strings of two characters by the same method as the bi-gram method.
  • the excluded word candidate extraction program 121 reads “this” and “ “,“ Pro ”, etc., are extracted and stored in the character string 610 column of the appearance frequency information 600. Thereafter, the excluded word candidate extraction program 121 counts the number of times each character string stored in the character string 610 column appears in the search target data 31 and stores the count result in the appearance number column 620.
  • step 320 Since the processing performed in step 320 is the same as that described in the case where the rule (1) is read, description thereof is omitted here.
  • the excluded word candidate extraction program 121 selects a two-character string composed of a character pointed to by the forward pointer H and a character (one character) adjacent to the rear of the character. For example, as shown in FIG. 11, when the forward pointer H points to the character “P”, a character string (“P”) and a character string “B” adjacent to the rear of “P” ( That is, “Pro”) is selected. Then, the excluded word candidate extraction program 121 compares the appearance count information 620 of the selected character string with the appearance count of the hit word by referring to the appearance frequency information 600 created in step 310, so that both are comparable. It is determined whether it appears at the frequency of.
  • various methods can be selected as a method for determining whether or not both appear at the same frequency. For example, “Number of occurrences of selected character string ⁇ Number of occurrences of hit word” Is within a predetermined range (for example, 0.5 to 2), it may be determined that both appear at the same frequency and the selected character string is an excluded character string.
  • step 350 is executed.
  • step 360 is executed.
  • step 350 the excluded word candidate extraction program 121 subtracts 1 from the forward pointer H. That is, it is the same as that described in the case where the rule (1) is read.
  • the excluded word candidate extraction program 121 selects a two-character string composed of a character pointed to by the backward pointer T and a character (one character) adjacent in front thereof. For example, as shown in FIG. 11, when the backward pointer T points to the character “La”, a character string (“La” and a character string “G” adjacent to the front of “La” ( That is, “gra” is selected. Then, the excluded word candidate extraction program 121 compares the appearance frequency information 600 created in step 310 and compares the number of appearances 620 of the selected character string with the number of appearances of the hit word, so that both are comparable. It is determined whether it appears at the frequency of. This determination is the same as the determination performed in step 340. As a result of the determination, if both appear with the same frequency (step 370: connection), then step 350 is executed, and if both do not appear with the same frequency (step 370: no connection) Step 390 is then executed.
  • step 380 the excluded word candidate extraction program 121 adds 1 to the backward pointer T, as described above. Thereafter, the process of step 370 is performed.
  • the excluded word candidate extraction program 121 performs the processing in steps 370 and 380 until the selected character string is not determined to be an excluded character string (or until the backward pointer T indicates the end of the search target data 31). repeat.
  • step 390 to step 410 The processing from step 390 to step 410 is the same as the processing described in the case where rule (1) is read.
  • the above is the method of identifying excluded word candidates by the excluded word candidate extraction program 121.
  • the example in which the information on the rule (1) or the rule (2) is first read by the excluded word candidate extraction program 121 has been described.
  • the information on the rule (1) or the rule (2) is preliminarily stored. It may be embedded in the excluded word candidate extraction program 121. In that case, step 310 need not be executed.
  • the above is the information search method according to the first embodiment.
  • the information search system according to the first embodiment specifies excluded word candidates based on the characteristics (character type and appearance frequency) of adjacent character strings of hit words, and presents the specified excluded word candidates to the user. Since only the information related to the adjacent character string of the hit word (characteristic information such as character type and appearance frequency) is used to specify the excluded word candidate, the information search system according to the present embodiment searches for the excluded word candidate. No dictionary is needed. Then, the user can specify a word (exclusion word) to be excluded from the search result from the candidate exclusion words presented by the information search system, and when the exclusion word is specified, the information search system designates the specified exclusion word. Present search results that do not contain. Thereby, the information search system according to the present embodiment can reduce the search noise from the search result even when the user does not have much knowledge about the data to be searched and analyzed.
  • the method of providing a search result to a user is only to provide the number of appearances. It is not limited.
  • the information search system may be configured such that the contents of the line including the hit word are output to the output device 25.
  • the example in which the information search system outputs (displays) the number of search words (or excluded word candidates) included in the search target data 31 as the number of appearances has been described. The number of lines including the search term may be output (displayed).
  • the server 1 executes a search program 120 ′, an excluded word candidate extraction program 121 ′, and an excluded word candidate sort program 122 ′. Since these programs are almost the same as the search program 120, the excluded word candidate extraction program 121, and the excluded word candidate sort program 122 described in the first embodiment, the differences will be mainly described below.
  • the client program 221 ' is executed. Similar to the client program 221 described in the first embodiment, the client program 221 ′ provides a GUI (Graphical User Interface) for a user to issue an information search instruction. However, the content of the information output to the output device 25 is slightly different from that in the first embodiment.
  • GUI Graphic User Interface
  • the information search system according to Example 1 determines whether or not it is an excluded word candidate based on the character type and appearance frequency of the adjacent character string of the hit word.
  • the information search system according to the second embodiment also makes the same determination.
  • atypical text information column 306
  • structured data in columns 301 to 305 is used. Is used to determine the display order of excluded word candidates.
  • FIG. 14 shows an example of the search target data 31, and the contents are the same as those shown in FIG.
  • excluded word candidates are determined by the same method as described in the first embodiment.
  • the search word is “log”
  • the result of searching the text data in the column 306 is, for example, “blog” or “program” is identified as an excluded word candidate.
  • the information search system further determines the display priority of the excluded word candidate by using values stored in columns other than the column 306 of the search target data 31. For example, referring to a column other than the column 306 for each row in which the same word as the search word “log” appears, the same value may be stored. For example, in the example of FIG. 14, there are two lines in which the search term “log” appears. In each column 304 (column whose column name is “component”), the word “register” is stored as a value. That is, in the example of FIG. 14, the value “register” in the column 304 is a value co-occurring with the search term “log”.
  • a column other than the column 306 is referred to for a line in which an excluded word candidate (for example, “blog”) appears
  • the candidate for a negative word has a high similarity to the search word and is highly important for the user It is thought that. Since it is better to display such words preferentially, in the information search system according to this embodiment, a value having a high co-occurrence rate with the search word and a value having a high co-occurrence rate with the excluded word candidate are obtained. Extraction is performed, and the display priority of the excluded word candidates is determined based on the extracted words.
  • the co-occurrence rate of a hit word and a certain value (this is called a column value) in a certain column (columns other than the column 306) is determined as follows. Let A be the number of rows in which the hit word exists in each row of the search target data 31. Also, let B be the number of rows in which column values are stored in columns other than the column 306 among rows in which hit words exist. In this case, the co-occurrence rate of the hit word and this column value is defined as B / A.
  • the co-occurrence rate of excluded word candidates and column values is determined in the same way. That is, the number of rows in the search target data 31 in which there are excluded word candidates is A ′, and the number of rows in which column values are stored in columns other than the column 306 among the rows in which there are excluded word candidates is B ′. , The co-occurrence rate of the excluded word candidate and this column value is defined as B ′ ⁇ A ′.
  • the hit location information 500, the excluded word candidate list 550, and the appearance frequency information 600 are created. Among these, the same information is created for the hit location information 500 and the appearance frequency information 600 even in the information search system according to the second embodiment. Therefore, detailed description of these pieces of information is not performed in the second embodiment.
  • the information search system according to the second embodiment creates an excluded word candidate list 550 'instead of the excluded word candidate list 550 described in the first embodiment. Furthermore, the information search system according to the second embodiment creates hit word information 500 '. Hereinafter, these two pieces of information will be described.
  • the format of the excluded word candidate list 550 ' will be described with reference to FIG.
  • the excluded word candidate list 550 ' is created by the excluded word candidate extraction program 121' and the excluded word candidate sort program 122 '.
  • the excluded word candidate list 550 ' includes a search word 551, a Length 552, and a candidate 553'. Search words 551 and Length 552 are the same as those included in the excluded word candidate list 550 in the first embodiment.
  • One or more candidates 553 ' exist in the excluded word candidate list 550'.
  • the area where the candidate 553' is stored (the area immediately after the Length 552) is called a candidate area 553-0.
  • the number of rows 553-2 ' is the number of rows in the search target data 31 that include the candidate word 553-1'.
  • the hit location 553-3 ′ includes one or more row numbers of rows in which the candidate word 553-1 ′ exists among the rows in the search target data 31.
  • the number of row numbers stored in the hit location 553-3 ' is equal to the row number 553-2'.
  • the co-occurrence information 553-4 includes a column 553-41, a value 553-42, and a co-occurrence rate 553-43.
  • the information search system calculates a co-occurrence rate between the excluded word candidate and each column value of columns (column 301 to column 305).
  • the column value with the highest co-occurrence rate is stored in the value 553-42
  • the column information (column name) to which the column value belongs is stored in the column 553-41
  • the value 553-42 and the excluded word candidate The co-occurrence rate of the candidate word 553-1 ′ is stored in the co-occurrence rate 553-43.
  • the hit word information 500 ′ is information similar to the hit location information 500, but is information created in the process of step 40 (processing executed by the excluded word candidate sort program 122 ′).
  • Search word 501 and Length 502 are the same as the information included in hit location information 500 described in the first embodiment.
  • the number of rows 504 ′ is the number of rows containing hit words among the rows in the search target data 31. However, when counting lines containing hit words, only the number of lines containing hit words that are not candidates for exclusion words is counted. For example, when the search term is “log” and “program” is specified as a candidate for the exclusion word, even if there is a line containing the character string “log” in the search target data 31, it is included in that line. If the character string “log” is “log” in “program” which is a candidate for excluded word, and if there is no character string including “log” other than the character string “program”, the line is not counted.
  • the co-occurrence information 505 includes a column 505-1, a value 505-2, and a co-occurrence rate 505-3.
  • the information search system calculates the co-occurrence rate between the search word and each column value of the columns (column 301 to column 305), stores the column value with the largest co-occurrence rate in the value 505-2, and the column value Column name is stored in the column 505-1, and the co-occurrence rate of the value 505-2 and the search word is stored in the co-occurrence rate 505-3.
  • the excluded word candidate extraction program 121 ′ and the excluded word candidate sort program 122 ′ called from the search program 120 ′ in step 30 and step 40 are the excluded word candidate list 550 ′ and the hit word information 500 ′. Is different from the search processing described in the first embodiment.
  • the search program 120 ′ transmits the hit word information 500 ′ and the excluded word candidate list 550 ′ to the client program 221 ′, and the client program 221 ′ stores the hit word information 500 ′ and the excluded word candidate list 550 ′.
  • the result display screen 450 ′ is used to display the output device 25 on the output device 25, which is different from that described in the first embodiment.
  • Steps 310 to 380 are the same as those described in the first embodiment, and a description thereof will be omitted.
  • step 390 to step 410 the excluded word candidate extraction program 121 'creates an excluded word candidate list 550'.
  • the excluded word candidate extraction program 121 ' determines an excluded word candidate. This is the same as that described in the first embodiment. Then, when the excluded word candidate extraction program 121 'records the excluded word candidate in the excluded word candidate list 550', the line number of the line in which the excluded word exists is recorded in the hit location 553-3 '.
  • the excluded word candidate extraction program 121 ′ records the number of line numbers recorded in the hit location 553-3 ′ for each candidate 553 ′ in the excluded word candidate list 550 ′ in the number of lines 553-2 ′. Perform the process. About the point other than these, it is the same as the process demonstrated in Example 1.
  • the information search system may specify a plurality of search terms from the user. When a plurality of search terms are designated by the user, the processing described below is executed for each created excluded word candidate list 550 '.
  • Step 4010 The excluded word candidate sorting program 122 'refers to the excluded word candidate list 550' and selects one candidate 553 '.
  • Step 4020 The excluded word candidate sorting program 122 ′ refers to the hit location 553-3 ′ of the candidate 553 ′ selected in Step 4010, and is recorded in the hit location 553-3 ′ in the search target data 31 row. Read all lines with the current line number. Subsequently, the excluded word candidate sorting program 122 'calculates the co-occurrence rate of the candidate word 553-1' and the column value for each column value included in the read row. The definition (calculation method) of the co-occurrence rate is as described above.
  • the excluded word candidate sorting program 122 ′ stores the column value having the maximum co-occurrence rate and the column name of the column to which the column value belongs in the value 553-42 and the column 553-41 in the candidate 553 ′, respectively.
  • the co-occurrence rate is stored in the co-occurrence rate 553-43.
  • Step 4030 The excluded word candidate sorting program 122 'determines whether step 4020 has been executed for all candidates 553' in the excluded word candidate list 550 '. When Step 4020 is executed for all candidates 553 '(Step 4030: Yes), Step 4040 is performed next. When the unprocessed candidate 553 'remains (Step 4030: No), the excluded word candidate sort program 122' performs the process from Step 4010 again.
  • Step 4040 In step 4040 to step 4050, the hit word information 500 'is created.
  • the excluded word candidate sorting program 122 copies the contents of the hit location information 500 to the hit word information 500'. Specifically, the search words 501 and 502 of the hit location information 500 are copied to the search words 501 and 502 of the hit word information 500 ′. Subsequently, the excluded word candidate sorting program 122 ′ reads all the rows specified by the row number 503-1 of the hit location information 500 among the rows of the search target data 31 onto the memory 12. Further, the excluded word candidate sorting program 122 'leaves only the lines including hit words that are not excluded word candidates among the read lines. The number of remaining lines is recorded in the number of lines 504 'of the hit word information 500'.
  • Step 4050 The excluded word candidate sorting program 122 ′ determines the hit word (search word) and the column value for each column value in the row read out in Step 4040 (a row including hit words that are not excluded word candidates). The co-occurrence rate of is calculated. Then, the excluded word candidate sort program 122 ′ stores the column value having the maximum co-occurrence rate and the column name of the column to which the column value belongs in the values 505-2 and 505-1 in the co-occurrence information 505, respectively. The co-occurrence rate is stored in the co-occurrence rate 505-3.
  • Step 4060 In Step 4060 to Step 4080, the plurality of candidates 553 'in the excluded word candidate list 550' are rearranged.
  • the excluded word candidate sorting program 122 ' first reads all candidates 553' in the excluded word candidate list 550 '.
  • the excluded word candidate sort program 122 ′ has the same column 553-41 and value 553-42 as the column 505-1 and value 505-2 of the hit word information 500 ′ among the plurality of candidates 553 ′ read out.
  • Candidate 553 ′ is selected.
  • the selected candidates 553 ' are sorted in descending order of the co-occurrence rate 553-43, and the sorted candidates 553' are stored from the top of the candidate area 553-0 in the excluded word candidate list 550 '.
  • Step 4070 Subsequently, the excluded word candidate sorting program 122 ′ has a column 553-41 whose column word 553-41 matches the column 505-1 of the hit word information 500 ′ among the plurality of read candidates 553 ′ (however, the value 553— 42 is not the same as the value 505-2).
  • the candidates 553 ′ selected here are sorted in descending order of the co-occurrence rate 553-43, and the sorted candidates 553 ′ are sequentially stored in the candidate area 553-0.
  • Step 4080 Finally, the excluded word candidate sorting program 122 ′ selects the candidates 553 ′ that are not selected in Step 4060 and Step 407 from the plurality of read candidates 553 ′ in the descending order of the co-occurrence rates 553-43. Sort. Then, the excluded word candidate sorting program 122 'stores the sorted candidates 553' in the candidate area 553-0 in order, and ends the process. As a result, each candidate 553 ′ is sorted in step 4060, candidate 553 ′ sorted in step 4070, candidate 553 ′ sorted in step 4080, and step 4080 in the candidate area 553-0 of the excluded word candidate list 550 ′. The candidates are stored in the order of candidates 553 ′.
  • FIG. 18 shows an example of a result display screen 450 ′ displayed on the output device 25 (display) of the client 2 by the information search system according to the second embodiment.
  • the viewpoint column 451, the excluded word candidate column 453, and the excluded word number column 454 are the same as those in the result display screen 450 described in the first embodiment.
  • a co-occurrence column column 456, a co-occurrence value column 457, and a co-occurrence rate column 458 are provided on the result display screen 450 ′ in the second embodiment, and sent from the search program 120 ′ to these columns.
  • Co-occurrence information 553-4 is displayed.
  • the column value “registration” having a high co-occurrence rate with the search term “log” is displayed in the co-occurrence value column 457, and the column name “component” of the column to which this column value belongs is co-occurrence.
  • the co-occurrence rate (83%) of the search term “log” and the column value “registration” is displayed in the co-occurrence rate column 458.
  • the co-occurrence value column 457 As the information displayed in the co-occurrence column 456, the co-occurrence value column 457, and the co-occurrence rate column 458, information displayed at the same height as each excluded word candidate (excluded word candidate column 453) is , Information about each excluded word candidate.
  • the column value “register” having a high co-occurrence rate with the exclusion word “application log” is displayed in the co-occurrence value column 457, and the column name “component” of the column to which this column value belongs is shared.
  • the co-occurrence rate (100%) of the excluded word “application log” and the column value “registration” is displayed in the co-occurrence rate column 456.
  • the client program 221 ′ in the second embodiment also displays excluded word candidates in the order of candidates 553 ′ stored in the excluded word candidate list 550 ′.
  • the excluded word candidates having the co-occurrence information 553-4 that is close (high in similarity) to the co-occurrence information 505 of the search word are sequentially stored according to the processing described in FIG. 17. Therefore, the excluded word candidates having the co-occurrence information 553-4 close to the search word co-occurrence information 505 are displayed in order on the result display screen 450 '.
  • candidate words that are similar to the search word and have a close co-occurrence tendency with the column value of the specific column in the search target data are preferentially displayed.
  • Candidate words having a close co-occurrence tendency are presumed to be highly relevant to the search word, and can be said to be important (consideration required) for the user. In the information search system according to the present embodiment, such words can be preferentially displayed.
  • a search client is provided separately from the search system server, and the user uses an input device and an output device of the client.
  • the client program may be executed by the search system server.
  • the user may issue an information search request using the input device and output device of the search system server.
  • the storage device and the search system server are separate devices, but the storage device may be built in the search system server.
  • the excluded word candidate extraction program 121 identifies an excluded word candidate based on the rule (1) or the rule (2).
  • the excluded word candidate extraction program 121 does not have to perform the excluded word candidate specifying process based on only the rule (1) (or only the rule (2)).
  • the excluded word candidate extraction program 121 performs both the excluded word candidate specifying process based on the rule (1) and the excluded word candidate specifying process based on the rule (2), and only the candidate words specified in any process are specified. May be presented to the user.
  • both the word specified in the excluded word candidate specifying process based on the rule (1) and the word specified in the excluded word candidate specifying process based on the rule (2) may be presented to the user. .
  • a program (excluded word candidate sort) executed on the server sorts excluded word candidates (that is, determines the display order of excluded word candidates) has been described.
  • the display order of the excluded word candidates may be determined by the client 2 sorting the excluded word candidates.

Abstract

In an information search system which is an embodiment of the present invention, upon receiving the specification of a search term, the position at which the specified search term is included, and character strings appearing before and after said position (adjacent character strings), are identified from data being searched. On the basis of the characteristics of the adjacent character strings, an assessment is made as to whether the adjacent character strings are character strings which constitute terms to be excluded from the subject of the search (excluded terms). If it is determined that the adjacent character stings constitute excluded terms, a determination is made that a term in which said character strings are conjoined with the search term is an excluded term candidate.

Description

情報検索方法Information retrieval method
 本発明は、情報検索方法に関する。 The present invention relates to an information search method.
 従来から、検索対象文書の中から、検索対象の文字列(以下、「検索語」と呼ぶ)を検索する検索技術がある。従来からある検索技術の多くは、対象となる文書のインデックスを作成しておき、利用者からの検索要求時にはインデックスを参照することで、利用者の求める文書を検索する方法が主流であった。ただしこの場合にはあらかじめインデックスが作成された文書でなければ、情報検索ができないため、利用者からの検索要求を受け付けた時に、検索対象文書全文と検索語の照合を行う、全文検索技術も現れている。 Conventionally, there is a search technology for searching a search target character string (hereinafter referred to as “search word”) from search target documents. In many conventional search techniques, an index of a target document is created, and a method for searching for a document requested by a user by referring to the index when a search request is made by a user has been mainstream. However, in this case, the information search cannot be performed unless the document is indexed in advance, and therefore, a full-text search technology that matches the search target document with the search term when a search request from a user is accepted also appears. ing.
 検索対象文書を全文検索する際の問題点の一つとして、検索語とは無関係な語(以下、「検索ノイズ」と呼ぶ)が検索結果に含まれてしまうという点がある。たとえば“ログ”を検索語として全文検索を行うと、“プログラム”のように、文字列“ログ”を含むが“ログ”と無関係の語も検索されてしまう。これは特に、日本語のように単語と単語の間にスペースが置かれない言語で書かれた文書の検索で、良く起こり得る問題である。 One problem with full-text search of search target documents is that words that are irrelevant to the search word (hereinafter referred to as “search noise”) are included in the search results. For example, when a full-text search is performed using “log” as a search word, a word that includes the character string “log” but is not related to “log” is also searched, such as “program”. This is a particularly common problem when searching for documents written in a language where there is no space between words, such as Japanese.
 検索ノイズを除去するための技術として、特許文献1では、単語特定用の辞書と、検索語文字列を部分文字列として含む語(以下、この語を「延長語」と呼ぶ)を特定するための延長語辞書を用いて、延長語を検索結果から除去する技術が開示されている。たとえば、検索語文字列として“ログ”が指定され、また延長語辞書には“ログ”を部分文字列として含む延長語(たとえば“プログラム”等)が登録されていたとする。この場合、検索対象文書に含まれていた“プログラム”は、検索結果からは除外されて報告されるため、検索ノイズを削減することができる。 As a technique for removing search noise, Patent Literature 1 specifies a word specifying dictionary and a word including a search word character string as a partial character string (hereinafter, this word is referred to as an “extended word”). A technique for removing an extension word from a search result by using the extension word dictionary is disclosed. For example, it is assumed that “log” is designated as a search word character string, and an extension word (for example, “program”) including “log” as a partial character string is registered in the extension word dictionary. In this case, since the “program” included in the search target document is excluded from the search result and reported, the search noise can be reduced.
特開平11-73429号公報Japanese Patent Laid-Open No. 11-73429
 特許文献1に開示の技術では、検索結果からノイズを除去するために、あらかじめ延長語の辞書を有していることが必要である。逆に延長語辞書に登録されていない語が検索対象文書に含まれている場合、その語をノイズとして除去することはできない。また、延長語辞書を持つためにコストがかかる。 In the technology disclosed in Patent Document 1, it is necessary to have an extension word dictionary in advance in order to remove noise from the search result. Conversely, when a word that is not registered in the extended word dictionary is included in the search target document, the word cannot be removed as noise. In addition, having an extended word dictionary is expensive.
 本発明の情報検索方法は、検索語の指定を受け付けると、検索対象データから、指定された検索語が含まれている位置、及びその位置の前後に隣接する文字列(隣接文字列)を特定する。そして隣接文字列の特性に基づいて、隣接文字列が検索対象から除外されるべき語(除外語)を構成する文字列であるか判定し、除外語を構成する文字列と判定された場合には、当該文字列と検索語を連結した語を除外語候補と決定する。 When the information search method of the present invention receives the specification of the search word, the position including the specified search word and the character string adjacent to the position (adjacent character string) are specified from the search target data. To do. Then, based on the characteristics of the adjacent character string, it is determined whether the adjacent character string is a character string constituting a word (excluded word) to be excluded from the search target, and when it is determined as a character string constituting the excluded word Determines a word obtained by connecting the character string and the search word as an excluded word candidate.
 本発明によれば、情報検索・分析者が、検索・分析対象データについての事前知識を有していない場合でも、検索ノイズを削減することができる。 According to the present invention, even when the information search / analyzer does not have prior knowledge about the search / analysis target data, search noise can be reduced.
検索システムの構成図である。It is a block diagram of a search system. 検索対象データの一例である。It is an example of search object data. 入力用画面の一例である。It is an example of the screen for input. 結果表示画面の一例である。It is an example of a result display screen. 検索処理のフローチャートである。It is a flowchart of a search process. ヒット箇所情報のフォーマットを表す図である。It is a figure showing the format of hit location information. 除外語候補リストのフォーマットを表す図である。It is a figure showing the format of an exclusion word candidate list. ルール(1)の概念を説明する図である。It is a figure explaining the concept of rule (1). ルール(2)の概念を説明する図である。It is a figure explaining the concept of a rule (2). 除外語候補抽出処理のフローチャートである。It is a flowchart of an exclusion word candidate extraction process. 前方ポインタと後方ポインタの説明図である。It is explanatory drawing of a front pointer and a back pointer. ルール(1)の概念を説明する図である。It is a figure explaining the concept of rule (1). 出現頻度情報のフォーマットを表す図である。It is a figure showing the format of appearance frequency information. 実施例2で用いられる検索対象データの例である。It is an example of the search object data used in Example 2. 実施例2における除外語候補リストのフォーマットを表す図である。It is a figure showing the format of the exclusion word candidate list | wrist in Example 2. FIG. 実施例2におけるヒットワード情報のフォーマットを表す図である。It is a figure showing the format of the hit word information in Example 2. FIG. 実施例2における除外語候補ソート処理のフローチャートである。12 is a flowchart of excluded word candidate sorting processing according to the second embodiment. 実施例2における結果表示画面の一例である。It is an example of the result display screen in Example 2.
 以下、本発明の実施例について、図面を用いて説明する。なお、以下に説明する実施例は特許請求の範囲に係る発明を限定するものではなく、また実施例の中で説明されている諸要素及びその組み合わせの全てが発明の解決手段に必須であるとは限らない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The embodiments described below do not limit the invention according to the claims, and all the elements and combinations described in the embodiments are essential for the solution of the invention. Is not limited.
 また、以後の説明では、ホスト等の計算機で実行される処理について、「プログラム」を主語として説明を行う場合がある。実際には、CPU(Central Processing Unit)がプログラムを実行することによって、プログラムに記述された処理が行われるため、処理の主体はプロセッサ(CPU)であるが、説明が冗長になることを防ぐため、プログラムを主語にして処理の内容を説明することがある。また、プログラムの一部または全ては専用ハードウェアによって実現されてもよい。以下で説明される各種プログラムは、プログラム配布サーバや計算機が読み取り可能な記憶メディアによって提供され、プログラムを実行する各装置にインストールされてもよい。計算機が読み取り可能な記憶メディアとは、非一時的なコンピュータ可読媒体で、例えばICカード、SDカード、DVD等の不揮発性記憶媒体である。 In the following description, processing executed by a computer such as a host may be described using “program” as the subject. Actually, since the processing described in the program is performed when the CPU (Central Processing Unit) executes the program, the main subject of the processing is the processor (CPU), but in order to prevent redundant explanation The contents of the process may be explained using the program as the subject. Further, part or all of the program may be realized by dedicated hardware. Various programs described below may be provided by a storage medium that can be read by a program distribution server or a computer, and may be installed in each device that executes the program. The computer-readable storage medium is a non-transitory computer-readable medium such as a non-volatile storage medium such as an IC card, an SD card, or a DVD.
 以下で、本発明の第1の実施例に係る情報検索システムの構成を説明していく。図1は、情報検索システムのハードウェア構成を示す図である。情報検索システムは、検索システムサーバ1(以下、「サーバ1」と略記する)、検索クライアント2(以下、「クライアント2」と略記する)、記憶装置3を有する。サーバ1とクライアント2は、例えばイーサネット(Ethernet)を用いて構成されたローカルエリアネットワーク(LAN)4を介して、相互通信可能に接続される。サーバ1は、たとえばファイバチャネル(FibreChannel)を用いて構成されたネットワーク5(またはSAN5と呼ばれる)を介して、記憶装置3と接続される。 The configuration of the information search system according to the first embodiment of the present invention will be described below. FIG. 1 is a diagram illustrating a hardware configuration of the information search system. The information search system includes a search system server 1 (hereinafter abbreviated as “server 1”), a search client 2 (hereinafter abbreviated as “client 2”), and a storage device 3. The server 1 and the client 2 are connected so as to be able to communicate with each other via a local area network (LAN) 4 configured using, for example, Ethernet. The server 1 is connected to the storage device 3 via a network 5 (or referred to as SAN 5) configured using, for example, a fiber channel (FibreChannel).
 サーバ1は、情報検索システムの利用者(以下、「ユーザ」と呼ぶ)から受領した、情報検索のリクエストに基づいて検索処理を行うコンピュータで、CPU11、主記憶(以下、「メモリ」と呼ぶこともある)12、LAN4に接続するためのネットワークポート13、入力装置14、出力装置15、ストレージポート16を有する。メモリ12はたとえばDRAM等の記憶デバイスで、CPU11がプログラムを実行する時に、そのプログラムまたはプログラムの実行時に用いられる制御情報等を格納するために用いられる。CPU11は検索処理を実施するためのプログラムを実行するコンポーネントである。入力装置14はたとえば、キーボードやマウスなど、ユーザが情報入力を行う際に用いるデバイスである。出力装置15はたとえば、ディスプレイやプリンタ等の表示(出力)装置である。本実施例では、出力装置15はディスプレイとする。ストレージポート16は、サーバ1と記憶装置3を接続するためのインタフェースである。 The server 1 is a computer that performs a search process based on an information search request received from a user of an information search system (hereinafter referred to as “user”). The server 11 is referred to as a main memory (hereinafter referred to as “memory”). 12), a network port 13 for connecting to the LAN 4, an input device 14, an output device 15, and a storage port 16. The memory 12 is a storage device such as a DRAM, for example, and is used to store the program or control information used when the program is executed when the CPU 11 executes the program. The CPU 11 is a component that executes a program for executing search processing. The input device 14 is a device used when the user inputs information, such as a keyboard and a mouse. The output device 15 is, for example, a display (output) device such as a display or a printer. In this embodiment, the output device 15 is a display. The storage port 16 is an interface for connecting the server 1 and the storage device 3.
 クライアント2は、ユーザがサーバ1に対して情報検索を指示したり、サーバ1から検索結果の出力を受領するために用いられるコンピュータである。クライアント2は、CPU21、主記憶22、LAN4に接続するためのネットワークポート23、入力装置24、出力装置(ディスプレイ)25を有する。CPU21、メモリ22、ネットワークポート23、入力装置24、出力装置25はそれぞれ、サーバ1のCPU11、メモリ12、ネットワークポート13、入力装置14、出力装置15と同様のものである。またクライアント2は、主記憶22の他に、磁気ディスクなどの補助記憶装置を備えていてもよい。 The client 2 is a computer that is used by the user to instruct the server 1 to search for information or to receive an output of a search result from the server 1. The client 2 includes a CPU 21, a main memory 22, a network port 23 for connecting to the LAN 4, an input device 24, and an output device (display) 25. The CPU 21, the memory 22, the network port 23, the input device 24, and the output device 25 are the same as the CPU 11, the memory 12, the network port 13, the input device 14, and the output device 15 of the server 1, respectively. In addition to the main memory 22, the client 2 may include an auxiliary storage device such as a magnetic disk.
 記憶装置3は、磁気ディスク等の不揮発性記憶デバイスを有する装置で、ユーザの検索対象となる情報を格納するための装置である。本実施例では、ユーザの検索対象となる情報のことを、「検索対象データ」(図中の「検索対象データ31」)と呼ぶ。また記憶装置3は、いわゆるディスクアレイ(またはRAID)のように、複数の不揮発性記憶デバイスを有する装置であってもよい。記憶装置3は、SAN5を介してサーバ1のストレージポート16に接続される。 The storage device 3 is a device having a nonvolatile storage device such as a magnetic disk, and is a device for storing information to be searched by the user. In the present embodiment, information to be searched by the user is referred to as “search target data” (“search target data 31” in the figure). The storage device 3 may be a device having a plurality of nonvolatile storage devices such as a so-called disk array (or RAID). The storage device 3 is connected to the storage port 16 of the server 1 via the SAN 5.
 サーバ1のメモリ12には、サーバ1で実行されるプログラムや、プログラムが使用する制御情報が格納される。サーバ1で実行されるプログラムとしては、たとえば検索プログラム120、除外語候補抽出プログラム121、除外語候補ソートプログラム122がある。サーバ1がユーザからの情報検索指示を受け付けると、これらのプログラムがCPU11で実行される。なお、これらのプログラムは、検索処理が行われないときには記憶装置3に格納されており、検索処理が行われる時に記憶装置3からメモリ12上に読み出される。また、これらのプログラムが使用する制御情報として、ヒット箇所情報500、除外語候補リスト550、出現頻度情報600がある。これらの詳細は後述する。なお、サーバ1は上で述べたプログラム以外のプログラム、そして上で述べた制御情報以外の情報を、メモリ12に格納してもよい。 The memory 12 of the server 1 stores a program executed by the server 1 and control information used by the program. Examples of programs executed by the server 1 include a search program 120, an excluded word candidate extraction program 121, and an excluded word candidate sort program 122. When the server 1 receives an information search instruction from the user, these programs are executed by the CPU 11. Note that these programs are stored in the storage device 3 when the search process is not performed, and are read from the storage device 3 onto the memory 12 when the search process is performed. Control information used by these programs includes hit location information 500, an excluded word candidate list 550, and appearance frequency information 600. Details of these will be described later. The server 1 may store in the memory 12 programs other than the programs described above and information other than the control information described above.
 クライアント2の主記憶22には、クライアントプログラム221が存在しており、CPU21がクライアントプログラム221を実行する。クライアントプログラム221は、ユーザが情報検索指示を発行するためのGUI(Graphical User Interface)を提供するプログラムである。 The client program 221 exists in the main memory 22 of the client 2, and the CPU 21 executes the client program 221. The client program 221 is a program that provides a GUI (Graphical User Interface) for a user to issue an information search instruction.
 続いて図2を用いて、検索対象データの形式を説明する。本実施例に係る情報検索処理で対象とされる検索対象データ31は一例として、図2に示されているように、複数の行と複数のカラムで構成される表形式のデータである。検索対象データ31は、複数のカラム301~306を有する。検索対象データ31の一番上の行は、各カラム301~306のカラム名である。本実施例では、カラム301~305は、氏名、日付等の、定型的な情報(構造化データ)が格納されるカラムとする。そしてカラム306には、自然言語で記述されたテキストが格納される。カラム306に格納される情報は、非定型のデータ(非構造化データ)である。本実施例に係る情報検索システムで検索対象とする情報は、カラム306に格納された非構造化データである。また、カラム301~305には、日付などの数値として表現される情報が格納されることもあれば、氏名などのように数値以外の情報も格納されることがある。ただし以下で説明する各実施例においては、カラム301~305に格納される情報が数値以外の文字情報であっても(たとえばカラム304には、「検索」や「登録」等の文字列が記録されている)、カラム301~305に格納される情報のことを「値」と呼ぶ。 Next, the format of the search target data will be described with reference to FIG. As an example, the search target data 31 targeted in the information search process according to the present embodiment is tabular data composed of a plurality of rows and a plurality of columns, as shown in FIG. The search target data 31 has a plurality of columns 301 to 306. The top row of the search target data 31 is the column name of each column 301-306. In this embodiment, the columns 301 to 305 are columns that store typical information (structured data) such as names and dates. A column 306 stores text written in a natural language. The information stored in the column 306 is atypical data (unstructured data). Information to be searched in the information search system according to the present embodiment is unstructured data stored in the column 306. The columns 301 to 305 may store information expressed as numerical values such as dates, or may store information other than numerical values such as names. However, in each embodiment described below, even if the information stored in the columns 301 to 305 is character information other than numerical values (for example, a character string such as “search” or “registration” is recorded in the column 304). The information stored in the columns 301 to 305 is referred to as “value”.
 なお、ここではカラム301~305には構造化データが格納される例を説明するが、カラム301~305に非構造化データが格納されていてもよい。また本実施例では、検索対象データが図2のような表形式のデータである例について説明するが、検索対象データは表形式のデータに限定されない。たとえばXMLなどで記述されたデータでも良いし、或いは単なるテキストファイルが1または複数集まったものであっても、本実施例に係る情報検索システムにおける検索対象となり得る。 Although an example in which structured data is stored in the columns 301 to 305 will be described here, unstructured data may be stored in the columns 301 to 305. In this embodiment, an example in which the search target data is tabular data as shown in FIG. 2 will be described. However, the search target data is not limited to tabular data. For example, data described in XML or the like may be used, or even a single text file or a collection of a plurality of text files may be a search target in the information search system according to the present embodiment.
 図3は、情報検索システムがユーザに提供する、検索語入力用画面の一例を示す。図3の入力用画面400は、クライアントプログラム221がクライアント2の出力装置25に出力(表示)する画面で、ユーザは、入力欄401に検索語を入力できるようになっている。図3に示されているように、入力欄401に検索語が複数入力されてよい。ユーザが、入力装置24であるマウスやキーボードを用いてボタン402を押下すると、クライアントプログラム221はサーバ1に、入力欄401に入力された検索語を通知し、サーバ1に検索を行わせる。 FIG. 3 shows an example of a search word input screen provided to the user by the information search system. An input screen 400 in FIG. 3 is a screen that the client program 221 outputs (displays) to the output device 25 of the client 2, and the user can input a search term in the input field 401. As shown in FIG. 3, a plurality of search terms may be entered in the input field 401. When the user presses the button 402 using the mouse or keyboard that is the input device 24, the client program 221 notifies the server 1 of the search term input in the input field 401 and causes the server 1 to perform a search.
 実施例1では、情報検索システムが検索対象データ31のカラム306に格納された非定型テキスト情報の中から、入力欄401に入力された検索語が含まれている件数を求める。図4に結果表示画面450の一例を示す。観点欄451に、ユーザが入力用画面400で指定した検索語が表示され、また観点欄451の下に表示される数値452(件数452と呼ぶ)は、検索対象データ31に含まれている検索語(観点欄451に表示された語)の件数である。図3の例のように検索語が3つ入力された場合、結果表示画面450では、3つの検索語ごとに件数452が表示される。 In the first embodiment, the information search system obtains the number of cases in which the search term input in the input field 401 is included from the atypical text information stored in the column 306 of the search target data 31. FIG. 4 shows an example of the result display screen 450. A search term specified by the user on the input screen 400 is displayed in the viewpoint column 451, and a numerical value 452 (referred to as the number 452) displayed below the viewpoint column 451 is a search included in the search target data 31. This is the number of words (words displayed in the viewpoint column 451). When three search terms are input as in the example of FIG. 3, the result display screen 450 displays the number 452 for each of the three search terms.
 また、観点欄451の右隣の欄453は、除外語候補を表示する欄である。本実施例では、情報検索システム(具体的にはサーバ1)が検索対象データ31から探索した、検索語を含んだ語(文字列)のことを除外語候補と呼ぶ。図4の例では、観点欄451の1つに“ログ”が表示されている。サーバ1が検索対象データ31の中から、“ログ”を含む語(図4の例では、“プログラム”、“ブログ”、“アプリケーションログ”)を発見した場合、それを除外語候補として抽出する。同時にサーバ1は、検索対象データ31内に存在する各除外語候補の数を計数する。除外語候補及びその件数はクライアントプログラム221に送られ、クライアントプログラム221は除外語候補を除外語候補欄453に表示し、除外語候補の件数を、除外語候補欄453の右隣にある除外語数欄454に表示する。また、この時点で件数452に表示される値は、除外語候補の件数も含む。 In addition, a column 453 on the right side of the viewpoint column 451 is a column for displaying candidates for excluded words. In this embodiment, a word (character string) including a search word searched from the search target data 31 by an information search system (specifically, server 1) is called an excluded word candidate. In the example of FIG. 4, “Log” is displayed in one of the viewpoint columns 451. When the server 1 finds a word including “log” from the search target data 31 (“program”, “blog”, “application log” in the example of FIG. 4), it extracts it as an excluded word candidate. . At the same time, the server 1 counts the number of excluded word candidates existing in the search target data 31. The excluded word candidates and the number thereof are sent to the client program 221, and the client program 221 displays the excluded word candidates in the excluded word candidate column 453, and the number of excluded word candidates is the number of excluded words on the right side of the excluded word candidate column 453. This is displayed in the column 454. Further, the value displayed in the number of cases 452 at this time includes the number of excluded word candidates.
 除外語数欄454の右隣には、チェックボックス455が表示される。チェックボックス455は、除外語候補ごとに存在する。ユーザがマウスなどを用いて、ある除外語候補のチェックボックス455をONにし、Refreshボタン460を押下すると、チェックボックス455がONにされた行の除外語候補欄453に表示されている語が、検索対象から除外される。その結果、件数452に表示される値からは、チェックボックス455がONにされた行の除外語候補の件数(除外語数欄454に表示される数)が減算される。本実施例では、除外語候補のうち、検索対象から除外されたもの、つまりユーザによってチェックボックス455をONにされたものを、「除外語」と呼ぶ。 A check box 455 is displayed on the right side of the excluded word count column 454. A check box 455 exists for each candidate for excluded word. When the user turns on the check box 455 of a certain excluded word candidate using the mouse or the like and presses the Refresh button 460, the word displayed in the excluded word candidate column 453 of the line in which the check box 455 is turned on is Excluded from search. As a result, the number of excluded word candidates (the number displayed in the excluded word number column 454) in the row where the check box 455 is turned on is subtracted from the value displayed in the number of cases 452. In this embodiment, of the excluded word candidates, those excluded from the search target, that is, those whose check box 455 is turned on by the user are referred to as “excluded words”.
 続いて、情報検索システムで実行される処理の流れを、図5を用いて説明する。ユーザが入力用画面400を用いて検索語を入力し、ボタン402を押下すると、検索指示がクライアントプログラム221から検索プログラム120に送信され、それに応じて図5の処理が開始される。なお、図5以降の各図において、参照番号の前に付されているアルファベットの“S”は、「ステップ」を意味する。また、以下の説明では特に断りのない限り、検索語(入力欄401に入力される語)が1語だけ指定された場合の例を説明する。ただし先に述べたとおり、本実施例に係る情報検索システムは、複数の検索語を受け付け可能に構成されている。 Subsequently, the flow of processing executed by the information search system will be described with reference to FIG. When the user inputs a search word using the input screen 400 and presses the button 402, a search instruction is transmitted from the client program 221 to the search program 120, and the processing of FIG. 5 is started accordingly. 5 and the subsequent figures, the alphabet “S” attached before the reference number means “step”. In the following description, an example in which only one search word (a word input in the input field 401) is designated will be described unless otherwise specified. However, as described above, the information search system according to the present embodiment is configured to accept a plurality of search terms.
 ステップ10:検索プログラム120は、検索対象データ31を読み出して、検索対象データ31(のカラム306)内に、検索語に一致する語が含まれているか判定する。本実施例では検索プログラム120は検索対象データ31をすべて読み出す必要はなく、カラム306に格納されたテキストだけを読み出せばよい。なお、本実施例では、検索対象データ31内に含まれている、検索語に一致する語のことを、「ヒットワード」と呼ぶ。検索語とヒットワードの内容は、当然ながら同一であるが、ユーザから指定された語(検索語)ではなく、検索対象データ31内に存在する文字列(検索語と一致する語)やその位置を特定したい場合に、「ヒットワード」という表現が用いられる。 Step 10: The search program 120 reads the search target data 31 and determines whether the search target data 31 (column 306) includes a word that matches the search word. In the present embodiment, the search program 120 does not need to read all the search target data 31, and only needs to read the text stored in the column 306. In the present embodiment, a word that is included in the search target data 31 and matches the search word is referred to as a “hit word”. The contents of the search word and the hit word are naturally the same, but not the word specified by the user (search word) but the character string (word that matches the search word) existing in the search target data 31 and its position The expression “hit word” is used when the user wants to specify.
 検索対象データ31(のカラム306)内にヒットワードが存在した場合、検索プログラム120はメモリ12に、ヒットワードが存在した位置(これを「ヒット箇所」と呼ぶ)の情報を記録する。メモリ12に格納される、ヒット箇所の情報(以下では、「ヒット箇所情報500」と呼ぶ)の例を、図6に示す。ヒット箇所情報500は、検索語ごとに作成される。 When there is a hit word in the search target data 31 (column 306), the search program 120 records in the memory 12 information on the position where the hit word exists (this is referred to as a “hit location”). An example of hit location information (hereinafter referred to as “hit location information 500”) stored in the memory 12 is shown in FIG. The hit location information 500 is created for each search term.
 検索語501に、クライアントプログラム221から指定された検索語が格納され、Length502には検索語の長さ(文字数)が格納される。そしてLength502の後に、ヒットワードの存在する位置についての情報(ヒット箇所503)が1以上格納される。ヒット箇所503は、行番号503-1とオフセット503-2の2つの情報から構成される。行番号503-1は、検索対象データ31のカラム301に格納される番号で、オフセット503-2は、検索対象データ31のカラム306に格納されるテキストの、先頭からの位置を表す。 The search term specified from the client program 221 is stored in the search term 501, and the length (number of characters) of the search term is stored in the Length 502. After Length 502, one or more pieces of information (hit location 503) about the position where the hit word exists are stored. The hit location 503 is composed of two pieces of information, a row number 503-1 and an offset 503-2. The line number 503-1 is a number stored in the column 301 of the search target data 31, and the offset 503-2 represents the position from the head of the text stored in the column 306 of the search target data 31.
 たとえば検索語が“ログ”で、検索対象データ31の1行目のカラム306に「このプログラムでは...」というテキストが格納されていた場合、この行のカラム306の先頭から4バイト目に“ログ”の文字が存在する。その場合検索プログラム120は、行番号503-1に“1”が、オフセット503-2に“4”の格納されたヒット箇所503を作成し、メモリ12に格納する。ただし図6に示したヒット箇所情報500は一例であり、ヒットワードの位置が一意に特定される情報がメモリ12に記録されれば、ヒット箇所情報の形式は任意で良い。頻度504には、検索語501が検索対象データ31中に出現した回数が格納される。これはヒット箇所503の個数と等しい。 For example, if the search term is “log” and the text “in this program...” Is stored in the first column 306 of the search target data 31, the fourth byte from the beginning of the column 306 in this row. “Log” is present. In this case, the search program 120 creates a hit location 503 in which “1” is stored in the row number 503-1 and “4” is stored in the offset 503-2, and is stored in the memory 12. However, the hit location information 500 shown in FIG. 6 is an example, and the format of the hit location information may be arbitrary as long as the information that uniquely identifies the position of the hit word is recorded in the memory 12. The frequency 504 stores the number of times that the search word 501 appears in the search target data 31. This is equal to the number of hit locations 503.
 ステップ20:ステップ10の結果、ヒット箇所情報500が生成されなかった場合(つまり検索対象データ31に検索語が含まれていない場合)、検索プログラム120は終了する。ヒット箇所情報500が生成された場合(ステップ20:Yes)、検索プログラム120は次にステップ30を実行する。 Step 20: As a result of Step 10, when the hit location information 500 is not generated (that is, when the search target data 31 does not include a search word), the search program 120 ends. When the hit location information 500 is generated (step 20: Yes), the search program 120 next executes step 30.
 ステップ30:検索プログラム120は除外語候補抽出プログラム121を呼び出し、除外語候補の特定を行わせる。検索プログラム120から呼び出された除外語候補抽出プログラム121は、ヒット箇所情報500を用いて除外語候補を特定する。また除外語候補抽出プログラム121はこの時、各除外語候補について、検索対象データ31内に存在した数(出現回数)を計数し、除外語候補とその出現回数を記録した情報である除外語候補リスト550を作成し、メモリ12に記録する。 Step 30: The search program 120 calls the excluded word candidate extraction program 121 to specify an excluded word candidate. The excluded word candidate extraction program 121 called from the search program 120 identifies an excluded word candidate using the hit location information 500. At this time, the excluded word candidate extraction program 121 counts the number (number of appearances) of each excluded word candidate existing in the search target data 31, and the excluded word candidate is information that records the excluded word candidate and the number of appearances. A list 550 is created and recorded in the memory 12.
 メモリ12に格納される除外語候補リスト550のフォーマットを、図7に示す。除外語候補リスト550は、検索語551、Length552、候補553を含む。検索語551、Length552はそれぞれ、クライアントプログラム221から指定された検索語、検索語の長さである。 The format of the excluded word candidate list 550 stored in the memory 12 is shown in FIG. The excluded word candidate list 550 includes a search word 551, a Length 552, and a candidate 553. A search term 551 and a length 552 are the length of the search term and the search term specified from the client program 221, respectively.
 候補553は、候補語553-1、出現回数553-2で構成される情報である。除外語候補が複数発見された場合、候補553は除外語候補リスト550内に複数存在する。候補語553-1は、検索プログラム120(正確には除外語候補抽出プログラム121)が特定した除外語候補であり、出現回数553-2は候補語553-1が検索対象データ31内に出現した回数である。除外語候補抽出プログラム121の処理の内容は後述する。 The candidate 553 is information composed of a candidate word 553-1 and the number of appearances 553-2. When a plurality of excluded word candidates are found, a plurality of candidates 553 exist in the excluded word candidate list 550. The candidate word 553-1 is an excluded word candidate specified by the search program 120 (exactly, the excluded word candidate extraction program 121), and the number of appearances 553-2 is that the candidate word 553-1 appears in the search target data 31. Is the number of times. The contents of the processing of the excluded word candidate extraction program 121 will be described later.
 ステップ40:検索プログラム120は除外語候補ソートプログラム122を呼び出して、ステップ30で特定された除外語候補の並べ替えを行わせる。呼び出された除外語候補ソートプログラム122は、除外語候補リスト550内に複数の候補553が存在する時、各候補553をソートする。ただし実施例1では、ソートは必須の処理ではない。そのため実施例1に係る情報検索システムでは、検索プログラム120がステップ40を実行しないようにしてもよい。 Step 40: The search program 120 calls the excluded word candidate sorting program 122 to rearrange the excluded word candidates specified in step 30. The called excluded word candidate sorting program 122 sorts each candidate 553 when there are a plurality of candidates 553 in the excluded word candidate list 550. However, in the first embodiment, sorting is not an essential process. Therefore, in the information search system according to the first embodiment, the search program 120 may not execute step 40.
 ソートの方法は任意であるが、たとえば各候補553が出現回数553-2の大きい順にソートされるとよい。その場合、ステップ40が実行されると、除外語候補リスト550に候補553が出現回数553-2の大きい順に格納される。 The sorting method is arbitrary. For example, the candidates 553 may be sorted in descending order of the number of appearances 553-2. In this case, when step 40 is executed, candidates 553 are stored in the excluded word candidate list 550 in descending order of appearance count 553-2.
 ステップ50:検索プログラム120はクライアントプログラム221に、結果表示画面450を表示させる。このために検索プログラム120はクライアントプログラム221に、ヒット箇所情報500に含まれる検索語501と頻度504、そして除外語候補リスト550を送信する。検索語が複数(たとえばn個)ある場合には、検索語501と頻度504と、除外語候補リスト550の組が複数(n組)送信される。 Step 50: The search program 120 causes the client program 221 to display a result display screen 450. For this purpose, the search program 120 transmits the search word 501 and the frequency 504 included in the hit location information 500 and the excluded word candidate list 550 to the client program 221. When there are a plurality of search words (for example, n), a plurality (n sets) of sets of the search word 501, the frequency 504, and the excluded word candidate list 550 are transmitted.
 検索プログラム120からこれらの情報を受領したクライアントプログラム221は、出力装置25に図4の結果表示画面450を表示し、ユーザからの指示を待つ。1つの検索語に対して送られてきた除外語候補(候補553)が複数ある場合、クライアントプログラム221は除外語候補リスト550に格納されている候補553の順に、除外語候補を表示する。もしステップ40で除外語候補リスト550に格納されている候補553が出現回数(出現回数553-2)の大きい順に格納されるようにソートされた場合、クライアントプログラム221は除外語候補を出現回数の多い順に表示することになる。 The client program 221 that has received the information from the search program 120 displays the result display screen 450 in FIG. 4 on the output device 25 and waits for an instruction from the user. When there are a plurality of excluded word candidates (candidates 553) sent for one search word, the client program 221 displays the excluded word candidates in the order of the candidates 553 stored in the excluded word candidate list 550. If the candidates 553 stored in the excluded word candidate list 550 are sorted so as to be stored in descending order of the number of appearances (appearance number 553-2) in step 40, the client program 221 selects the excluded word candidate as the number of appearances. It will be displayed in order from most.
 ここでユーザが、除外語候補欄453に表示されている語のうち、検索件数から除外したい語についてチェックボックス455をONにし、Refreshボタン460を押下すると、クライアントプログラム221はチェックボックス455がONにされた除外語候補を検索プログラム120に通知する。 Here, when the user turns on the check box 455 for a word that is to be excluded from the search number among the words displayed in the excluded word candidate column 453 and presses the Refresh button 460, the client program 221 turns on the check box 455. The search candidate 120 is notified of the excluded word candidate.
 ステップ60:検索プログラム120は、クライアントプログラム221から通知された除外語候補を除いた件数を算出し、その結果(ヒット箇所情報500に含まれる検索語501と頻度504、そして除外語候補リスト550)をクライアントプログラム221に通知する。通知を受けたクライアントプログラム221は、結果表示画面450の再表示を行う。再表示は、ステップ50で行われる処理と同様の処理が行われる。 Step 60: The search program 120 calculates the number of excluded word candidates notified from the client program 221 and the result (the search word 501 and the frequency 504 included in the hit location information 500, and the excluded word candidate list 550). To the client program 221. Upon receiving the notification, the client program 221 redisplays the result display screen 450. The re-display is performed in the same manner as the processing performed in step 50.
 ステップ70:ユーザが、ステップ60で再表示された結果表示画面450上で、除外語候補のチェックボックス455をONにし、Refreshボタン460を押下すると、ステップ60の処理が再び行われる。ユーザが除外語候補のチェックボックス455の再指定を行わなかった場合、検索処理は終了する。 Step 70: When the user turns on the check box 455 for the excluded word candidate and presses the Refresh button 460 on the result display screen 450 re-displayed in Step 60, the processing in Step 60 is performed again. If the user does not re-specify the check box 455 for the candidate exclusion word, the search process ends.
 次に、ステップ30で行われる、除外語候補抽出プログラム121の処理、つまり除外語候補の特定方法について説明する。本実施例に係る除外語候補抽出プログラム121は、所定のルールに基づいて、検索対象データ31に含まれているヒットワードの前後に隣接する文字(または文字列)が、ヒットワードに連結されて除外語候補を構成する文字(または文字列)であるか判定する。以下では、ヒットワードの前後に隣接する文字(または文字列)のことを「隣接文字列」と呼ぶ。なお、隣接文字列は1文字のこともあれば、2文字以上のこともあり得る。また隣接文字列のうち、ヒットワードに連結されて除外語候補を構成する文字(または文字列)のことを、除外文字列と呼ぶ。 Next, the processing of the excluded word candidate extraction program 121 performed in step 30, that is, the specified method of excluded word candidates will be described. The excluded word candidate extraction program 121 according to the present embodiment is configured such that characters (or character strings) adjacent to the hit word included in the search target data 31 are connected to the hit word based on a predetermined rule. It is determined whether it is a character (or character string) that constitutes an excluded word candidate. Hereinafter, characters (or character strings) adjacent to the hit word are referred to as “adjacent character strings”. The adjacent character string may be one character, or may be two or more characters. Of the adjacent character strings, the characters (or character strings) connected to the hit word and constituting the excluded word candidate are referred to as excluded character strings.
 本実施例では、この所定のルールとして2種類のルールを説明するが、これ以外のルールが除外語候補特定のために用いられてもよい。以下で、2種類のルールについて説明する。 In this embodiment, two types of rules will be described as the predetermined rule, but other rules may be used for specifying an excluded word candidate. Hereinafter, two types of rules will be described.
 (1)文字種による除外語候補識別ルール
 このルールでは、隣接文字列の文字種に基づいて、隣接文字列が除外文字列か否か、判断される。ここでの文字種とは、カタカナ、平仮名、漢字等のことを意味する。以下、このルールのことを「ルール(1)」と呼ぶ。除外語候補抽出プログラム121は、隣接文字列またはヒットワードの文字コードから、隣接文字列またはヒットワードの文字種を特定することができる。ルール(1)では、除外文字列であるか否かの判定が1文字ずつ行われる。以下の説明では、この判定の対象となる文字を「判定対象文字」と呼ぶ。
(1) Exclusion word candidate identification rule based on character type In this rule, it is determined whether or not the adjacent character string is an excluded character string based on the character type of the adjacent character string. The character type here means katakana, hiragana, kanji, and the like. Hereinafter, this rule is referred to as “rule (1)”. The excluded word candidate extraction program 121 can specify the character type of the adjacent character string or hit word from the character code of the adjacent character string or hit word. In rule (1), it is determined character by character whether or not it is an excluded character string. In the following description, the character to be determined is referred to as “determination target character”.
 まず、判定対象文字が、句読点(。や、)、スペース、改行等の、区切りを示す文字(以下では、区切り文字と呼ぶ)だった場合、判定対象文字は除外文字列ではないと判断される。区切り文字としては、句読点、スペース、改行、タブ、コロン、セミコロン、あるいはカッコが含まれる。 First, when the determination target character is a character indicating a delimiter (hereinafter referred to as a delimiter character) such as a punctuation mark (.,), A space, or a line feed, it is determined that the determination target character is not an excluded character string. . Delimiters include punctuation marks, spaces, line breaks, tabs, colons, semicolons, or parentheses.
 また判定対象文字が区切り文字ではない場合、判定対象文字に隣接する文字との比較を行うことで、除外文字列か否か判定される。判定対象文字が、判定対象文字に隣接する文字と同じ文字種である場合、判定対象文字が除外文字列と判断される。両者の文字種が同じ場合、判定対象文字が、判定対象文字に隣接する文字と同じ単語に含まれる文字である可能性が高いからである。具体的には、判定対象文字と判定対象文字に隣接する文字とがいずれもカタカナ(あるいは平仮名、漢字)の場合、判定対象文字が除外文字列と判断される。また、判定対象文字に隣接する文字が漢字の場合、判定対象文字がカタカナ、平仮名、漢字のいずれかであれば、隣接文字列は除外文字列と判断される。 Also, when the determination target character is not a delimiter, it is determined whether or not it is an excluded character string by comparing with a character adjacent to the determination target character. If the determination target character is the same character type as the character adjacent to the determination target character, the determination target character is determined to be an excluded character string. This is because, when both character types are the same, the determination target character is highly likely to be a character included in the same word as the character adjacent to the determination target character. Specifically, when both the determination target character and the character adjacent to the determination target character are katakana (or hiragana, kanji), the determination target character is determined to be an excluded character string. When the character adjacent to the determination target character is a kanji character, if the determination target character is any one of katakana, hiragana, or kanji, the adjacent character string is determined to be an excluded character string.
 逆に、判定対象文字が、判定対象文字に隣接する文字と異なる文字種の場合、隣接文字列は除外文字列ではないと判断される。具体的には判定対象文字が平仮名で、判定対象文字に隣接する文字がカタカナの場合(あるいはその逆の場合)、隣接文字列は除外文字列ではない。 Conversely, when the determination target character is a character type different from the character adjacent to the determination target character, it is determined that the adjacent character string is not an excluded character string. Specifically, when the determination target character is hiragana and the character adjacent to the determination target character is katakana (or vice versa), the adjacent character string is not an excluded character string.
 図8を参照しながら、ルール(1)に基づく除外語候補特定方法の例を説明する。図8の例は、検索語が「ログ」で、検索対象データ31に「このプログラムでは」という文字列が含まれている場合を想定している。この例の場合、ヒットワードである「ログ」の前方に隣接する文字は順に、「プ」、「の」…である。「プ」は、ヒットワードである「ログ」と同じ文字種(カタカナ)であるため、除外文字列と判定される。一方で、「プ」の前方に連結される文字である「の」は、「の」に隣接する文字である「プ」とは異なる文字種(ひらがな)であるため、除外文字列ではないと判定される。 Referring to FIG. 8, an example of an excluded word candidate identification method based on rule (1) will be described. The example in FIG. 8 assumes a case where the search term is “log” and the search target data 31 includes the character string “in this program”. In this example, the characters adjacent to the front of the hit word “log” are “P”, “NO”,. Since “P” has the same character type (Katakana) as the hit word “Log”, it is determined as an excluded character string. On the other hand, “no”, which is a character concatenated in front of “p”, is a character type (hiragana) different from “p”, which is a character adjacent to “no”, and is therefore not an excluded character string. Is done.
 ヒットワード(ログ)の後方に連結される文字(「ラ」、「ム」、「で」…)についても同様の判定が行われる。つまり、ヒットワード(ログ)の後方に連結される文字のうち、「ラ」と「ム」はヒットワードと同じ文字種(カタカナ)であるため、「ラ」と「ム」は除外文字列と判定される。そして「ム」の後方に連結される文字である「で」は、「で」に隣接する文字「ム」と異なる文字種(ひらがな)であるため、除外文字列ではないと判定される。 The same determination is performed for characters ("La", "M", "DE" ...) connected to the back of the hit word (log). In other words, among the characters concatenated after the hit word (log), “La” and “M” are the same character type (Katakana) as the hit word, so “La” and “M” are determined to be excluded character strings. Is done. Since “de”, which is a character concatenated behind “m”, is a character type (hiragana) different from the character “m” adjacent to “de”, it is determined that it is not an excluded character string.
 (2)2-gram(bi-gram)方式を用いた除外語識別ルール
 このルールでは、検索語の出現頻度と隣接文字列の出現頻度の比較の結果に基づいて、隣接文字列が除外文字列か、判定される。以下、このルールのことを「ルール(2)」と呼ぶ。図9を参照しながら、ルール(2)を用いた除外語候補特定方法の例を説明する。
(2) Exclusion word identification rule using 2-gram (bi-gram) method In this rule, an adjacent character string is an excluded character string based on the comparison result of the appearance frequency of the search word and the appearance frequency of the adjacent character string. Is determined. Hereinafter, this rule is referred to as “rule (2)”. An example of an excluded word candidate specifying method using rule (2) will be described with reference to FIG.
 図8と同様に、図9の例も、検索語が「ログ」で、検索対象データ31に「このプログラムでは」という文字列が含まれている場合を想定している。また図9では、検索語に一致するヒットワードである文字列「ログ」が検索対象データ31内に出現する回数(出現頻度)が、12000回であった場合の例が示されている。ルール(2)の場合、まずヒットワードの隣接文字列の、検索対象データ31内の出現頻度を求める。この時隣接文字列の長さは任意長でもよいが、本実施例では隣接文字列の長さが2の場合の例を説明する。 As in FIG. 8, the example of FIG. 9 also assumes that the search term is “log” and the search target data 31 includes the character string “in this program”. FIG. 9 shows an example in which the number of occurrences (appearance frequency) of the character string “log”, which is a hit word matching the search word, appears in the search target data 31 is 12,000. In the case of rule (2), the appearance frequency in the search target data 31 of the character string adjacent to the hit word is first obtained. At this time, the length of the adjacent character string may be arbitrary, but in this embodiment, an example in which the length of the adjacent character string is 2 will be described.
 ルール(2)では、いわゆるbi-gramインデックスを作成する場合と同様に、検索対象データ31中の2文字の文字列ごとに出現頻度を求める。図9の例では、ヒットワードである「ログ」の前方に位置する2文字の文字列である「プロ」が、検索対象データ31内に出現した回数が10000回、また「ログ」の後方に位置する2文字の文字列である「グラ」と「ラム」については、検索対象データ31内に出現した回数がいずれも8000回である。つまり、文字列「プロ」、「グラ」、「ラム」の出現回数はいずれも、検索語「ログ」の出現回数(12000回)と同程度である。ルール(2)では、2文字の文字列の出現回数が、検索語の出現回数と同程度の場合、これらの2文字の文字列(図9の例の場合、「プロ」、「グラ」、「ラム」)は、除外文字列と判定される。検索語の出現頻度と近い頻度で出現する文字列(隣接文字列)は、検索語(ヒットワード)と連結されて1つの語を形成する文字列である可能性が高いからである。 In rule (2), as in the case of creating a so-called bi-gram index, the appearance frequency is obtained for each character string of two characters in the search target data 31. In the example of FIG. 9, “Pro”, which is a character string of two characters located in front of the hit word “Log”, appears in the search target data 31 10000 times, and behind “Log”. The number of occurrences of “gra” and “ram”, which are two character strings located in the search target data 31, is 8000 times. That is, the number of appearances of the character strings “pro”, “gra”, and “ram” are all the same as the number of appearances of the search term “log” (12,000 times). In rule (2), when the number of appearances of a two-character string is approximately the same as the number of appearances of a search word, these two-character strings (in the example of FIG. 9, “pro”, “gra”, “Ram”) is determined to be an excluded character string. This is because a character string (adjacent character string) that appears at a frequency close to the appearance frequency of the search word is highly likely to be a character string that is concatenated with the search word (hit word) to form one word.
 一方、文字列「プロ」の前方に位置する2文字の文字列「のプ」の出現回数は800回、文字列「ラム」の後方に位置する2文字の文字列「ムで」の出現回数は100回で、検索語「ログ」の出現回数(12000回)と大きく異なる。そのためルール(2)では、2文字の文字列「のプ」と「ムで」は、除外文字列ではないと判断される。そして「のプ」よりも前方に位置する文字列(「この」等)、及び「ムで」よりも後方に位置する文字列(「では」等)についても同様に、除外文字列でないと判断される。2文字の文字列の出現回数と検索語の出現回数とが同程度か否かを判断する方法としては、様々な方法があり得る。たとえば両者の差の絶対値が0以上所定値未満であるかを判定する方法、あるいは両者の比が1に近い値かを判定する方法などがあり得る。 On the other hand, the number of appearances of the two-character string “NO” located in front of the character string “PRO” is 800 times, and the number of appearances of the two-character string “MU” located behind the character string “ram” Is 100 times, which is greatly different from the number of appearances of the search term “log” (12000 times). For this reason, in the rule (2), it is determined that the two character strings “NO” and “MU” are not excluded character strings. Similarly, a character string positioned in front of “NO” (such as “this”) and a character string positioned behind “MU” (such as “in”) are also determined not to be excluded characters. Is done. There are various methods for determining whether or not the number of appearances of a two-character string and the number of appearances of a search word are similar. For example, there may be a method of determining whether the absolute value of the difference between the two is 0 or more and less than a predetermined value, or a method of determining whether the ratio between the two is a value close to 1.
 図10を用いて、ステップ30で行われる除外語候補抽出プログラム121の処理の流れを説明する。本実施例に係る情報検索システムでは、ルール(1)に基づいた除外語候補特定方法とルール(2)に基づいた除外語候補特定方法とのいずれかが選択可能に構成されている。記憶装置3には、接続ルール32が格納されており、接続ルール32には上で述べたルール(1)とルール(2)の情報が格納されている。除外語候補抽出プログラム121はユーザからの指定に基づいて、接続ルール32の中からルール(1)またはルール(2)のいずれかを読み込む。あるいは、ユーザとは異なる情報検索システムの管理者が、除外語候補抽出プログラム121が読み込むべきルール(ルール(1)またはルール(2))を決定して、検索プログラム120(または除外語候補抽出プログラム121)に対して、読み込むべきルールを指定できるように構成されていてもよい。 The process flow of the excluded word candidate extraction program 121 performed in step 30 will be described with reference to FIG. In the information search system according to the present embodiment, either an excluded word candidate specifying method based on the rule (1) or an excluded word candidate specifying method based on the rule (2) can be selected. The storage device 3 stores a connection rule 32, and the connection rule 32 stores information on the rules (1) and (2) described above. The excluded word candidate extraction program 121 reads either rule (1) or rule (2) from the connection rule 32 based on designation from the user. Alternatively, the administrator of the information search system different from the user determines a rule (rule (1) or rule (2)) to be read by the excluded word candidate extraction program 121, and the search program 120 (or excluded word candidate extraction program). 121), a rule to be read may be designated.
 ステップ310:除外語候補抽出プログラム121は、記憶装置3内の接続ルール32から、ルール(1)またはルール(2)の情報を読み込む。以下ではまず、ステップ310でルール(1)の情報が読み込まれたケース(つまりルール(1)に基づいて除外語候補を特定するケース)を説明する。 Step 310: The excluded word candidate extraction program 121 reads the information of the rule (1) or the rule (2) from the connection rule 32 in the storage device 3. In the following, a case where the information of rule (1) is read in step 310 (that is, a case where an excluded word candidate is specified based on rule (1)) will be described.
 ステップ320:除外語候補抽出プログラム121はヒット箇所情報500から、ヒット箇所503を1つ選択し、検索対象データ31内のヒットワードの出現位置を特定する。また除外語候補抽出プログラム121はヒットワードの隣接文字列の特定のため、2つの変数(前方ポインタH、後方ポインタT)を用意する。ヒット箇所情報500の説明で述べたとおり、検索対象データ31内の位置を特定するためには、行番号とオフセットの2種類の情報が用いられるが、前方ポインタH、後方ポインタTには、オフセットのみが格納される。ただし別の実施形態として、行番号とオフセットの組が、前方ポインタH、後方ポインタTに格納されるようにしてもよい。 Step 320: The excluded word candidate extraction program 121 selects one hit location 503 from the hit location information 500, and specifies the appearance position of the hit word in the search target data 31. Further, the excluded word candidate extraction program 121 prepares two variables (a forward pointer H and a backward pointer T) for specifying an adjacent character string of a hit word. As described in the description of the hit location information 500, two types of information, a row number and an offset, are used to specify the position in the search target data 31, but the forward pointer H and the backward pointer T have an offset. Only stored. However, as another embodiment, a set of row number and offset may be stored in the forward pointer H and the backward pointer T.
 ステップ320では除外語候補抽出プログラム121は、前方ポインタH、後方ポインタTに初期値を設定する。図11を用いて前方ポインタH、後方ポインタTについて説明する。前方ポインタHは、ヒットワードの出現位置の前に位置する文字を指し示すポインタで、初期値には、ヒットワードの先頭位置より1文字前の位置が格納される。図11の例では、ヒットワードである「ログ」の位置(正確には文字列[ログ]の先頭文字である「ロ」の位置)が、(行番号=22、オフセット=4)である。そのため前方ポインタHの初期値には3(=4-1)が設定される。後方ポインタTはヒットワードの後方に位置する文字を指し示すポインタであり、初期値には、ヒットワードの最後尾の次の文字の位置が格納される。図11の例では、ヒットワードである「ログ」の直後に隣接する文字である「ラ」の位置情報、つまり6が初期値として格納される。 In step 320, the excluded word candidate extraction program 121 sets initial values for the forward pointer H and the backward pointer T. The forward pointer H and the backward pointer T will be described with reference to FIG. The forward pointer H is a pointer indicating a character located before the appearance position of the hit word, and the initial value stores a position one character before the start position of the hit word. In the example of FIG. 11, the position of “log” that is the hit word (more precisely, the position of “ro” that is the first character of the character string [log]) is (line number = 22, offset = 4). Therefore, 3 (= 4-1) is set as the initial value of the forward pointer H. The backward pointer T is a pointer that points to the character located behind the hit word, and the initial value stores the position of the character next to the end of the hit word. In the example of FIG. 11, the position information of “La”, which is the adjacent character immediately after “Log”, which is the hit word, that is, 6 is stored as an initial value.
 ステップ340:除外語候補抽出プログラム121は、前方ポインタHで指し示される文字列について判定を行う。判定方法は先に説明したとおりで、ルール(1)が用いられる場合には、除外語候補抽出プログラム121は文字種に基づき、選択された文字列が除外文字列か判定する。そのため除外語候補抽出プログラム121は、前方ポインタHで指し示される文字と、その文字の後方に隣接する文字の文字種を比較する。比較の結果、前方ポインタHで指し示される文字が先に述べたルール(1)に合致する場合、除外語候補抽出プログラム121は前方ポインタHで指し示される文字は除外文字列と判断する。また前方ポインタHで指し示される文字が区切り文字だった場合、除外語候補抽出プログラム121は前方ポインタHで指し示される文字は除外文字列でないと判断する。 Step 340: The excluded word candidate extraction program 121 determines the character string pointed to by the forward pointer H. The determination method is as described above. When rule (1) is used, the excluded word candidate extraction program 121 determines whether the selected character string is an excluded character string based on the character type. Therefore, the excluded word candidate extraction program 121 compares the character type of the character pointed to by the forward pointer H and the character adjacent to the back of the character. As a result of the comparison, if the character pointed to by the forward pointer H matches the rule (1) described above, the excluded word candidate extraction program 121 determines that the character pointed to by the forward pointer H is an excluded character string. If the character pointed to by the forward pointer H is a delimiter, the excluded word candidate extraction program 121 determines that the character pointed to by the forward pointer H is not an excluded character string.
 ステップ350:除外語候補抽出プログラム121は、前方ポインタHが現在指し示している文字の1字前の文字を指すように、前方ポインタHの値を更新する。具体的には除外語候補抽出プログラム121は、前方ポインタHの値を1減算する。その後除外語候補抽出プログラム121は、ここで更新された前方ポインタHで指し示される文字列についてステップ340を実行する。除外語候補抽出プログラム121は、前方ポインタHで指し示される文字列が除外文字列と判定されなくなるまで(あるいは前方ポインタHが0になるまで)、ステップ340とステップ350の処理を繰り返す。 Step 350: The excluded word candidate extraction program 121 updates the value of the forward pointer H so that the forward pointer H points to the character immediately preceding the character currently pointed to. Specifically, the excluded word candidate extraction program 121 subtracts 1 from the value of the forward pointer H. Thereafter, the excluded word candidate extraction program 121 executes Step 340 for the character string pointed to by the forward pointer H updated here. The excluded word candidate extraction program 121 repeats the processing in steps 340 and 350 until the character string pointed to by the forward pointer H is not determined to be an excluded character string (or until the forward pointer H becomes 0).
 図11の例で説明すると、最初に前方ポインタHは文字「プ」を指し示している。そのためステップ340の判定において、文字「プ」が検索語(ログ)と同じ文字種(カタカナ)であると判定されるので、次にステップ350が実行される。ステップ350では除外語候補抽出プログラム121は、前方ポインタHが文字「プ」の前方に隣接する文字(つまり「の」である)を指し示すようにし(Hを1減算して2にする)、再びステップ340の判定を行う。 Referring to the example of FIG. 11, first, the forward pointer H points to the character “P”. Therefore, in the determination in step 340, it is determined that the character “P” is the same character type (katakana) as the search word (log), and therefore step 350 is executed next. In step 350, the excluded word candidate extraction program 121 causes the forward pointer H to point to the character adjacent to the front of the character “P” (that is, “NO”) (H is subtracted by 1 to 2), and again. The determination in step 340 is performed.
 文字「の」は、「の」に隣接する文字「プ」と異なる文字種(平仮名)である。そのため除外語候補抽出プログラム121が文字「の」についてステップ340の判定を行うと、選択された文字「の」は、除外文字列ではないと判定される。結果、図11の例の場合、文字「プ」は除外文字列と判断されるが、それよりも前方に位置する文字列(「の」や「こ」)は、除外文字列でないと判断される。 The character “NO” is a character type (Hiragana) different from the character “P” adjacent to “NO”. Therefore, when the excluded word candidate extraction program 121 determines in step 340 for the character “NO”, it is determined that the selected character “NO” is not an excluded character string. As a result, in the example of FIG. 11, the character “p” is determined to be an excluded character string, but character strings (“no” and “ko”) positioned in front of it are determined not to be excluded character strings. The
 ステップ370:除外語候補抽出プログラム121は、後方ポインタTで指し示される文字列について判定を行う。判定方法はステップ340と同様である。つまり除外語候補抽出プログラム121は、後方ポインタTで指し示される文字と、その文字の前方に隣接する文字の文字種を比較し、両者が同じ文字種か否か判定する(同じ文字種であれば、後方ポインタTで指し示される文字は除外文字列と判定される)。また、後方ポインタTで指し示される文字が区切り文字の場合には、後方ポインタTで指し示される文字は除外文字列でないと判定される。判定の結果、選択された文字列が除外文字列と判定された場合(ステップ370:接続あり)、次にステップ380が行われ、そうでない場合には次にステップ390が行われる。 Step 370: The excluded word candidate extraction program 121 determines the character string pointed to by the backward pointer T. The determination method is the same as in step 340. That is, the excluded word candidate extraction program 121 compares the character pointed to by the backward pointer T and the character type of the character adjacent to the front of the character, and determines whether or not they are the same character type (if they are the same character type, The character pointed to by the pointer T is determined as an excluded character string). If the character pointed to by the backward pointer T is a delimiter, it is determined that the character pointed to by the backward pointer T is not an excluded character string. If it is determined that the selected character string is an excluded character string (step 370: connected), step 380 is performed next, and otherwise step 390 is performed.
 ステップ380:除外語候補抽出プログラム121は、後方ポインタTが現在指し示している文字の1字後の文字を指すように、後方ポインタTの値を更新する。ここでは前方ポインタHの更新方法とは逆のことが行われるとよい。つまり除外語候補抽出プログラム121は後方ポインタTに1を加算すればよい。ステップ370で判定された文字列の後方に隣接する文字を1文字選択する。除外語候補抽出プログラム121は、後方ポインタTで指し示される文字列が、除外文字列と判定されなくなるまで(あるいは後方ポインタTが検索対象データ31の終端を指すようになるまで)、ステップ370とステップ380の処理を繰り返す。 Step 380: The excluded word candidate extraction program 121 updates the value of the backward pointer T so that the backward pointer T points to the character immediately after the character currently pointed to. Here, the reverse of the method for updating the forward pointer H is preferably performed. That is, the excluded word candidate extraction program 121 may add 1 to the backward pointer T. One character adjacent to the rear of the character string determined in step 370 is selected. The excluded word candidate extraction program 121 determines that the character string pointed to by the backward pointer T is determined not to be an excluded character string (or until the backward pointer T points to the end of the search target data 31). The process of step 380 is repeated.
 図11の例で説明すると、最初に後方ポインタTは文字「ラ」を指し示している。そしてステップ370の判定において、文字「ラ」が検索語(ログ)と同じ文字種(カタカナ)であると判定されるので、次にステップ380が実行される。ステップ380では除外語候補抽出プログラム121は、文字「ラ」の後方に隣接する文字(つまり「ム」)が指示されるように後方ポインタTを更新し(Tに1加算して7にする)、再びステップ370の判定を行う。 Referring to the example of FIG. 11, first, the backward pointer T points to the character “La”. Then, in the determination in step 370, it is determined that the character “ra” is the same character type (katakana) as the search word (log), so step 380 is executed next. In step 380, the excluded word candidate extraction program 121 updates the backward pointer T so that the character adjacent to the character “La” (that is, “M”) is indicated (add 1 to T to 7). Then, the determination in step 370 is performed again.
 文字「ム」は、その前方に隣接する文字「ラ」と同じ文字種であるから、ステップ370の判定により、選択された文字列「ム」は除外文字列と判定される。そのため除外語候補抽出プログラム121は再びステップ380を実行する。ステップ380を実行することで、後方ポインタTが文字「で」を指し示すようになるので、文字「で」についてステップ370の判定が行われる。 Since the character “M” has the same character type as the character “La” adjacent to the character “M”, the selected character string “M” is determined to be an excluded character string by the determination in Step 370. Therefore, the excluded word candidate extraction program 121 executes Step 380 again. By executing step 380, the backward pointer T points to the character “de”. Therefore, the determination in step 370 is performed for the character “de”.
 文字「で」は、その前方に隣接する文字「ム」と異なる文字種(平仮名)であるから、ステップ370の判定で、文字「で」は除外文字列ではないと判定される。結果、図11の例の場合、文字「ラ」と「ム」は除外文字列と判断されるが、それよりも後方に位置する文字列(「で」や「は」)は、除外文字列でないと判断される。 Since the character “de” is a character type (Hiragana) that is different from the character “m” that is adjacent to the front of the character “de”, it is determined in step 370 that the character “de” is not an excluded character string. As a result, in the example of FIG. 11, the characters “La” and “M” are determined to be excluded character strings, but the character strings (“de” and “ha”) located behind the character strings are excluded character strings. It is judged that it is not.
 ステップ390:除外語候補抽出プログラム121は、前方ポインタHに1を加算し、後方ポインタTから1を減算する。そして除外語候補抽出プログラム121は、前方ポインタHにより指し示される文字を先頭文字とし、後方ポインタTで指し示される文字を終端文字とする文字列を、除外語候補と決定し、決定された除外語候補を除外語候補リスト550の候補553に記録する。たとえば図11に示された文字列に対して、ステップ380までの処理が実行されると、前方ポインタHは文字「の」を指し、後方ポインタTは文字「で」を指した状態になっている。ステップ390で除外語候補抽出プログラム121が前方ポインタHのオフセットに1を加算し、後方ポインタTのオフセットから1を減算した結果、前方ポインタHは文字「プ」を指し、後方ポインタTは文字「ム」を指す。そのため文字列「プログラム」が除外語候補として決定される。 Step 390: The excluded word candidate extraction program 121 adds 1 to the forward pointer H and subtracts 1 from the backward pointer T. Then, the excluded word candidate extraction program 121 determines a character string having the character pointed to by the forward pointer H as the first character and the character pointed to by the backward pointer T as the terminal character as an excluded word candidate, and the determined exclusion The word candidate is recorded in the candidate 553 of the excluded word candidate list 550. For example, when the processing up to step 380 is executed on the character string shown in FIG. 11, the forward pointer H points to the character “NO” and the backward pointer T points to the character “DE”. Yes. In step 390, the excluded word candidate extraction program 121 adds 1 to the offset of the forward pointer H and subtracts 1 from the offset of the backward pointer T. As a result, the forward pointer H points to the character “p”, and the backward pointer T "". Therefore, the character string “program” is determined as an excluded word candidate.
 もちろん除外語候補が得られないこともある。たとえば図12に示された例のように、検索対象データ31内に「検出の為にログが」という文字列が含まれていた場合、ヒットワード「ログ」の前後に位置する文字はいずれも平仮名で、ヒットワード「ログ」の文字種(カタカナ)と異なる。この場合には、ヒットワード「ログ」の前後に位置する文字(「に」や「が」)は、除外文字列でないと判定されるため、除外語候補は得られない。この場合には除外語候補を除外語候補リスト550に記録する処理は行われない。また、ステップ390で得られた除外語候補がすでに除外語候補リスト550に記録されている場合にも、除外語候補を除外語候補リスト550に記録する処理は行われない。 Of course, there are cases in which no excluded word candidates can be obtained. For example, as in the example shown in FIG. 12, when the search target data 31 includes the character string “log for detection”, all the characters positioned before and after the hit word “log” Hiragana is different from the character type (Katakana) of the hit word “Log”. In this case, it is determined that the characters (“Ni” and “Ga”) located before and after the hit word “log” are not excluded character strings, and therefore no excluded word candidates are obtained. In this case, the process of recording the excluded word candidate in the excluded word candidate list 550 is not performed. Further, even when the excluded word candidate obtained in step 390 is already recorded in the excluded word candidate list 550, the process of recording the excluded word candidate in the excluded word candidate list 550 is not performed.
 ステップ400:除外語候補抽出プログラム121は、ヒット箇所情報500に格納されている全てのヒット箇所503について、ステップ320~ステップ380の処理を実行したか判定する。まだ全てのヒット箇所503についてステップ320~ステップ380の処理を実行していない場合(ステップ400:No)、除外語候補抽出プログラム121は再びステップ320から処理を行う。全てのヒット箇所503についてステップ320~ステップ380の処理を実行した場合(ステップ400:Yes)、次にステップ410が行われる。 Step 400: The excluded word candidate extraction program 121 determines whether or not the processing of Step 320 to Step 380 has been executed for all the hit locations 503 stored in the hit location information 500. If the processing of step 320 to step 380 has not been executed for all hit locations 503 (step 400: No), the excluded word candidate extraction program 121 performs the processing from step 320 again. If the processing of step 320 to step 380 has been executed for all hit locations 503 (step 400: Yes), then step 410 is performed.
 ステップ410:除外語候補抽出プログラム121は、除外語候補リスト550に記録された各除外語候補について、検索対象データ31内の出現回数を計数する。そして除外語候補抽出プログラム121は、計数された出現回数を除外語候補リスト550(出現回数553-2)に記録して、処理を終了する。 Step 410: The excluded word candidate extraction program 121 counts the number of appearances in the search target data 31 for each excluded word candidate recorded in the excluded word candidate list 550. Then, the excluded word candidate extraction program 121 records the counted number of appearances in the excluded word candidate list 550 (number of appearances 553-2), and ends the process.
 ステップ310でルール(1)が読み込まれたケースでは、上で説明した流れで除外語候補の特定が行われる。続いて、ステップ310でルール(2)が読み込まれたケースについて説明する。この場合も、上の説明と同様の流れで処理が行われるため、以下では、上の説明と異なる点を中心に説明する。 In the case where the rule (1) is read in step 310, the exclusion word candidate is specified according to the flow described above. Next, a case where rule (2) is read in step 310 will be described. In this case as well, the processing is performed in the same flow as described above, and therefore, the description below will focus on differences from the above description.
 ステップ310でルール(2)が読み込まれると、除外語候補抽出プログラム121はステップ320を実行する前に、検索対象データ31内の各文字列の出現頻度の計数を行い、出現頻度情報600を作成する。出現頻度情報600の例を図13に示す。除外語候補抽出プログラム121は検索対象データ31(のカラム306)内全データを、bi-gram法と同様の方法で、2文字ごとの文字列に分解する。 When the rule (2) is read in step 310, the excluded word candidate extraction program 121 counts the appearance frequency of each character string in the search target data 31 and creates the appearance frequency information 600 before executing step 320. To do. An example of the appearance frequency information 600 is shown in FIG. The excluded word candidate extraction program 121 decomposes all data in the search target data 31 (in the column 306) into character strings of two characters by the same method as the bi-gram method.
 単純な例として、図9のように検索対象データ31に「このプログラムでは」という文字列が含まれている場合、除外語候補抽出プログラム121は検索対象データ31から、「この」、「のプ」、「プロ」等の、2文字の文字列を抽出し、出現頻度情報600の文字列610カラムに格納する。その後除外語候補抽出プログラム121は、文字列610カラムに格納された各文字列が、検索対象データ31内に出現する回数を計数し、計数結果を出現回数カラム620に格納する。 As a simple example, when the search target data 31 includes a character string “in this program” as shown in FIG. 9, the excluded word candidate extraction program 121 reads “this” and “ “,“ Pro ”, etc., are extracted and stored in the character string 610 column of the appearance frequency information 600. Thereafter, the excluded word candidate extraction program 121 counts the number of times each character string stored in the character string 610 column appears in the search target data 31 and stores the count result in the appearance number column 620.
 ステップ320で行われる処理は、ルール(1)が読み込まれたケースで説明したものと同じであるため、ここでの説明は略す。 Since the processing performed in step 320 is the same as that described in the case where the rule (1) is read, description thereof is omitted here.
 ステップ340では、まず除外語候補抽出プログラム121は、前方ポインタHで指し示される文字とその後方に隣接する文字(1文字)から構成される2文字の文字列を選択する。たとえば図11に示されているように、前方ポインタHが文字「プ」を指し示している場合、文字「プ」と、「プ」の後方に隣接する文字「ロ」から構成される文字列(つまり「プロ」)が選択される。そして除外語候補抽出プログラム121は、ステップ310で作成された出現頻度情報600を参照することで、選択された文字列の出現回数620とヒットワードの出現回数を比較することで、両者が同程度の頻度で出現しているか判定する。先にも述べたが、両者が同程度の頻度で出現しているか判定する方法としては、様々な方法を選択可能である。たとえば、
“選択された文字列の出現回数÷ヒットワードの出現回数”
が所定の範囲内(たとえば0.5~2など)に収まっている場合、両者が同程度の頻度で出現しており、選択された文字列が除外文字列であると判断するとよい。
In step 340, first, the excluded word candidate extraction program 121 selects a two-character string composed of a character pointed to by the forward pointer H and a character (one character) adjacent to the rear of the character. For example, as shown in FIG. 11, when the forward pointer H points to the character “P”, a character string (“P”) and a character string “B” adjacent to the rear of “P” ( That is, “Pro”) is selected. Then, the excluded word candidate extraction program 121 compares the appearance count information 620 of the selected character string with the appearance count of the hit word by referring to the appearance frequency information 600 created in step 310, so that both are comparable. It is determined whether it appears at the frequency of. As described above, various methods can be selected as a method for determining whether or not both appear at the same frequency. For example,
“Number of occurrences of selected character string ÷ Number of occurrences of hit word”
Is within a predetermined range (for example, 0.5 to 2), it may be determined that both appear at the same frequency and the selected character string is an excluded character string.
 両者が同程度の頻度で出現している場合(ステップ340:接続あり)、次にステップ350が実行される。逆に両者が同程度の頻度で出現していない場合(ステップ340:接続なし)、次にステップ360が実行される。 If both appear at the same frequency (step 340: connected), then step 350 is executed. On the other hand, if both do not appear at the same frequency (step 340: no connection), then step 360 is executed.
 ステップ350では、除外語候補抽出プログラム121は、前方ポインタHを1減算する。つまりルール(1)が読み込まれたケースで説明したものと同じである。 In step 350, the excluded word candidate extraction program 121 subtracts 1 from the forward pointer H. That is, it is the same as that described in the case where the rule (1) is read.
 ステップ370では除外語候補抽出プログラム121は、後方ポインタTで指し示される文字とその前方に隣接する文字(1文字)から構成される2文字の文字列を選択する。たとえば図11に示されているように、後方ポインタTが文字「ラ」を指し示している場合、文字「ラ」と、「ラ」の前方に隣接する文字「グ」から構成される文字列(つまり「グラ」)が選択される。そして除外語候補抽出プログラム121は、ステップ310で作成された出現頻度情報600を参照することで、選択された文字列の出現回数620とヒットワードの出現回数を比較することで、両者が同程度の頻度で出現しているか判定する。この判定はステップ340で行われる判定と同じである。判定の結果、両者が同程度の頻度で出現している場合(ステップ370:接続あり)、次にステップ350が実行され、両者が同程度の頻度で出現していない場合(ステップ370:接続なし)、次にステップ390が実行される。 In step 370, the excluded word candidate extraction program 121 selects a two-character string composed of a character pointed to by the backward pointer T and a character (one character) adjacent in front thereof. For example, as shown in FIG. 11, when the backward pointer T points to the character “La”, a character string (“La” and a character string “G” adjacent to the front of “La” ( That is, “gra” is selected. Then, the excluded word candidate extraction program 121 compares the appearance frequency information 600 created in step 310 and compares the number of appearances 620 of the selected character string with the number of appearances of the hit word, so that both are comparable. It is determined whether it appears at the frequency of. This determination is the same as the determination performed in step 340. As a result of the determination, if both appear with the same frequency (step 370: connection), then step 350 is executed, and if both do not appear with the same frequency (step 370: no connection) Step 390 is then executed.
 ステップ380では、上で説明したものと同じく、除外語候補抽出プログラム121は後方ポインタTに1を加算する。その後ステップ370の処理が行われる。除外語候補抽出プログラム121は、選択された文字列が除外文字列と判定されなくなるまで(あるいは後方ポインタTが検索対象データ31の終端を指すようになるまで)、ステップ370とステップ380の処理を繰り返す。 In step 380, the excluded word candidate extraction program 121 adds 1 to the backward pointer T, as described above. Thereafter, the process of step 370 is performed. The excluded word candidate extraction program 121 performs the processing in steps 370 and 380 until the selected character string is not determined to be an excluded character string (or until the backward pointer T indicates the end of the search target data 31). repeat.
 ステップ390~ステップ410の処理は、ルール(1)が読み込まれたケースで説明した処理と同じである。 The processing from step 390 to step 410 is the same as the processing described in the case where rule (1) is read.
 以上が除外語候補抽出プログラム121による除外語候補の特定方法である。上では最初にルール(1)またはルール(2)の情報が除外語候補抽出プログラム121によって読み込まれる例を説明したが、別の実施形態として、ルール(1)またはルール(2)の情報があらかじめ除外語候補抽出プログラム121に埋め込まれていてもよい。その場合にはステップ310が実行される必要はない。 The above is the method of identifying excluded word candidates by the excluded word candidate extraction program 121. In the above, the example in which the information on the rule (1) or the rule (2) is first read by the excluded word candidate extraction program 121 has been described. However, as another embodiment, the information on the rule (1) or the rule (2) is preliminarily stored. It may be embedded in the excluded word candidate extraction program 121. In that case, step 310 need not be executed.
 以上が、実施例1に係る情報検索方法である。実施例1に係る情報検索システムは、ヒットワードの隣接文字列の特性(文字種や出現頻度)に基づいて除外語候補を特定し、特定された除外語候補をユーザに提示する。除外語候補の特定には、ヒットワードの隣接文字列に関する情報(文字種や出現頻度等の特性情報)だけが使用されるので、本実施例に係る情報検索システムは、除外語候補を探索するための辞書が不要である。そしてユーザは、情報検索システムから提示された除外語候補の中から、検索結果から除外したい語(除外語)を指定可能で、除外語が指定されると情報検索システムは、指定された除外語を含まない検索結果を提示する。これにより本実施例に係る情報検索システムは、ユーザが検索・分析対象のデータに対する知識をあまり持っていない場合でも、検索結果から検索ノイズを減らすことができる。 The above is the information search method according to the first embodiment. The information search system according to the first embodiment specifies excluded word candidates based on the characteristics (character type and appearance frequency) of adjacent character strings of hit words, and presents the specified excluded word candidates to the user. Since only the information related to the adjacent character string of the hit word (characteristic information such as character type and appearance frequency) is used to specify the excluded word candidate, the information search system according to the present embodiment searches for the excluded word candidate. No dictionary is needed. Then, the user can specify a word (exclusion word) to be excluded from the search result from the candidate exclusion words presented by the information search system, and when the exclusion word is specified, the information search system designates the specified exclusion word. Present search results that do not contain. Thereby, the information search system according to the present embodiment can reduce the search noise from the search result even when the user does not have much knowledge about the data to be searched and analyzed.
 なお、実施例1では、検索結果として、検索語(または除外語候補)の出現回数が出力(表示)される例を説明したが、検索結果のユーザへの提供方法は出現回数の提供だけに限定されない。たとえば情報検索システムは、ヒットワードを含む行の内容が出力装置25に出力されるように構成されていてもよい。また実施例1では、情報検索システムが検索対象データ31内に含まれている、検索語(または除外語候補)の語数を出現回数として出力(表示)する例を説明したが、情報検索システムが検索語を含む行の行数を出力(表示)するようにしてもよい。 In the first embodiment, an example in which the number of appearances of a search word (or an excluded word candidate) is output (displayed) as a search result has been described. However, the method of providing a search result to a user is only to provide the number of appearances. It is not limited. For example, the information search system may be configured such that the contents of the line including the hit word are output to the output device 25. In the first embodiment, the example in which the information search system outputs (displays) the number of search words (or excluded word candidates) included in the search target data 31 as the number of appearances has been described. The number of lines including the search term may be output (displayed).
 続いて、実施例2に係る情報検索方法の説明を行う。実施例2に係る情報検索システムのハードウェア構成は、実施例1で説明したものと同じである。 Subsequently, an information search method according to the second embodiment will be described. The hardware configuration of the information search system according to the second embodiment is the same as that described in the first embodiment.
 また実施例2に係る情報検索システムにおいて、サーバ1では、検索プログラム120’、除外語候補抽出プログラム121’、除外語候補ソートプログラム122’が実行される。これらのプログラムはそれぞれ、実施例1で説明した検索プログラム120、除外語候補抽出プログラム121、除外語候補ソートプログラム122と殆ど同様のものであるため、以下では相違点を中心に説明する。 In the information search system according to the second embodiment, the server 1 executes a search program 120 ′, an excluded word candidate extraction program 121 ′, and an excluded word candidate sort program 122 ′. Since these programs are almost the same as the search program 120, the excluded word candidate extraction program 121, and the excluded word candidate sort program 122 described in the first embodiment, the differences will be mainly described below.
 一方クライアント2では、クライアントプログラム221’が実行される。クライアントプログラム221’は、実施例1で説明したクライアントプログラム221と同様に、ユーザが情報検索指示を発行するためのGUI(Graphical User Interface)を提供する。ただし出力装置25に出力する情報の内容が、実施例1におけるものと若干相違する。 On the other hand, in the client 2, the client program 221 'is executed. Similar to the client program 221 described in the first embodiment, the client program 221 ′ provides a GUI (Graphical User Interface) for a user to issue an information search instruction. However, the content of the information output to the output device 25 is slightly different from that in the first embodiment.
 実施例1に係る情報検索システムは、ヒットワードの隣接文字列の文字種や出現頻度によって、除外語候補か否かを判断していた。実施例2に係る情報検索システムもこれと同様の判断を行う。また実施例1に係る情報検索システムでは、検索対象データ31内の非定型テキスト情報(カラム306)のみを用いていたが、実施例2に係る情報検索システムでは、カラム301~305の構造化データを用いて除外語候補の表示順序の決定を行う。 The information search system according to Example 1 determines whether or not it is an excluded word candidate based on the character type and appearance frequency of the adjacent character string of the hit word. The information search system according to the second embodiment also makes the same determination. In the information search system according to the first embodiment, only atypical text information (column 306) in the search target data 31 is used. However, in the information search system according to the second embodiment, structured data in columns 301 to 305 is used. Is used to determine the display order of excluded word candidates.
 図14を用いて、実施例2に係る情報検索システムで行われる方法の概念を説明する。図14は検索対象データ31の一例で、内容は図2に示したものと同様である。 The concept of the method performed in the information search system according to the second embodiment will be described with reference to FIG. FIG. 14 shows an example of the search target data 31, and the contents are the same as those shown in FIG.
 実施例2に係る情報検索システムにおいても、実施例1で説明した方法と同じ方法で、除外語候補が決定される。検索語が「ログ」であった場合、カラム306内のテキストデータを検索した結果、たとえば「ブログ」や「プログラム」が除外語候補として特定される。 Also in the information search system according to the second embodiment, excluded word candidates are determined by the same method as described in the first embodiment. When the search word is “log”, the result of searching the text data in the column 306 is, for example, “blog” or “program” is identified as an excluded word candidate.
 実施例2に係る情報検索システムはさらに、検索対象データ31のカラム306以外のカラムに格納されている値を用いて、除外語候補の表示優先度を決定する。たとえば検索語「ログ」と同じ語が出現した行のそれぞれについてカラム306以外のカラムを参照すると、同じ値が格納されていることがある。たとえば図14の例では、検索語「ログ」が出現した行が2行存在する。そしてその行のカラム304(カラム名が「コンポ」のカラム)にはいずれも、「登録」の語が値として格納されている。つまり図14の例では、カラム304の値「登録」は、検索語「ログ」と共起している値である。 The information search system according to the second embodiment further determines the display priority of the excluded word candidate by using values stored in columns other than the column 306 of the search target data 31. For example, referring to a column other than the column 306 for each row in which the same word as the search word “log” appears, the same value may be stored. For example, in the example of FIG. 14, there are two lines in which the search term “log” appears. In each column 304 (column whose column name is “component”), the word “register” is stored as a value. That is, in the example of FIG. 14, the value “register” in the column 304 is a value co-occurring with the search term “log”.
 また、除外語候補(たとえば「ブログ」)が出現した行について、カラム306以外のカラムを参照した場合にも、除外語候補と共起している値が存在していることがあり得る。除外語候補との共起率が高い値と、検索語との共起率が高い値が共通している場合、除外語候補は検索語との類似性が高く、ユーザにとって重要性が高い語であると考えられる。そのような語は優先的に表示させたほうが良いため、本実施例に係る情報検索システムでは、検索語との共起率が高い値と、除外語候補との共起率が高い値とを抽出し、それに基づいて除外語候補の表示優先度を決定する。 Also, when a column other than the column 306 is referred to for a line in which an excluded word candidate (for example, “blog”) appears, there may be a value co-occurring with the excluded word candidate. When a value with a high co-occurrence rate with a candidate for a negative word and a value with a high co-occurrence rate with a search word are common, the candidate for a negative word has a high similarity to the search word and is highly important for the user It is thought that. Since it is better to display such words preferentially, in the information search system according to this embodiment, a value having a high co-occurrence rate with the search word and a value having a high co-occurrence rate with the excluded word candidate are obtained. Extraction is performed, and the display priority of the excluded word candidates is determined based on the extracted words.
 本実施例における共起率の定義を説明する。まず、ヒットワードと、あるカラム(カラム306以外のカラム)内のある値(これをカラム値と呼ぶ)との共起率は、以下のように定められる。検索対象データ31の各行の中に、ヒットワードが存在する行の数をAとする。またヒットワードが存在する行のうち、カラム306以外のカラムにカラム値が格納されている行の数をBとする。この場合、ヒットワードとこのカラム値の共起率は、B÷Aと定められる。 The definition of the co-occurrence rate in this example will be described. First, the co-occurrence rate of a hit word and a certain value (this is called a column value) in a certain column (columns other than the column 306) is determined as follows. Let A be the number of rows in which the hit word exists in each row of the search target data 31. Also, let B be the number of rows in which column values are stored in columns other than the column 306 among rows in which hit words exist. In this case, the co-occurrence rate of the hit word and this column value is defined as B / A.
 また除外語候補とカラム値の共起率も同様に定められる。つまり、除外語候補が存在する検索対象データ31中の行の数をA’とし、除外語候補が存在する行のうちカラム306以外のカラムにカラム値が格納されている行の数をB’とした時、除外語候補とこのカラム値の共起率は、B’÷A’と定められる。 Also, the co-occurrence rate of excluded word candidates and column values is determined in the same way. That is, the number of rows in the search target data 31 in which there are excluded word candidates is A ′, and the number of rows in which column values are stored in columns other than the column 306 among the rows in which there are excluded word candidates is B ′. , The co-occurrence rate of the excluded word candidate and this column value is defined as B ′ ÷ A ′.
 続いて、実施例2に係る情報検索システムで用いられる情報について説明する。実施例1に係る情報検索システムでは、ヒット箇所情報500、除外語候補リスト550、出現頻度情報600が作成された。このうち、ヒット箇所情報500と出現頻度情報600については、実施例2に係る情報検索システムでも同じものが作成される。そのためこれらの情報の詳細な説明は、実施例2では行わない。 Subsequently, information used in the information search system according to the second embodiment will be described. In the information search system according to the first embodiment, the hit location information 500, the excluded word candidate list 550, and the appearance frequency information 600 are created. Among these, the same information is created for the hit location information 500 and the appearance frequency information 600 even in the information search system according to the second embodiment. Therefore, detailed description of these pieces of information is not performed in the second embodiment.
 また、実施例2に係る情報検索システムは、実施例1で説明した除外語候補リスト550に代えて、除外語候補リスト550’を作成する。さらに実施例2に係る情報検索システムは、ヒットワード情報500’を作成する。以下ではこれら2つの情報について説明する。 Also, the information search system according to the second embodiment creates an excluded word candidate list 550 'instead of the excluded word candidate list 550 described in the first embodiment. Furthermore, the information search system according to the second embodiment creates hit word information 500 '. Hereinafter, these two pieces of information will be described.
 図15を用いて、除外語候補リスト550’のフォーマットについて説明する。除外語候補リスト550’は、除外語候補抽出プログラム121’と除外語候補ソートプログラム122’によって作成される。除外語候補リスト550’は、検索語551、Length552、候補553’を有する。検索語551、Length552は、実施例1における除外語候補リスト550に含まれるものと同じである。候補553’は除外語候補リスト550’内に1以上存在する。なお、除外語候補リスト550’の、候補553’の格納される領域(Length552の直後の領域)のことは、候補領域553-0と呼ばれる。 The format of the excluded word candidate list 550 'will be described with reference to FIG. The excluded word candidate list 550 'is created by the excluded word candidate extraction program 121' and the excluded word candidate sort program 122 '. The excluded word candidate list 550 'includes a search word 551, a Length 552, and a candidate 553'. Search words 551 and Length 552 are the same as those included in the excluded word candidate list 550 in the first embodiment. One or more candidates 553 'exist in the excluded word candidate list 550'. In the excluded word candidate list 550 ', the area where the candidate 553' is stored (the area immediately after the Length 552) is called a candidate area 553-0.
 候補553’には、候補語553-1’、行数553-2’、ヒット箇所553-3’、共起情報553-4が含まれる。候補語553-1’は、実施例1で説明した候補語553-1と同じである。行数553-2’は、検索対象データ31内の行のうち、候補語553-1’が含まれている行の数である。ヒット箇所553-3’は、図15に示されているように、検索対象データ31内の行のうち、候補語553-1’が存在する行の行番号を1以上含む。ヒット箇所553-3’に格納されている行番号の数は、行数553-2’と等しい。 Candidate 553 'includes candidate word 553-1', number of lines 553-2 ', hit location 553-3', and co-occurrence information 553-4. Candidate word 553-1 'is the same as candidate word 553-1 described in the first embodiment. The number of rows 553-2 'is the number of rows in the search target data 31 that include the candidate word 553-1'. As shown in FIG. 15, the hit location 553-3 ′ includes one or more row numbers of rows in which the candidate word 553-1 ′ exists among the rows in the search target data 31. The number of row numbers stored in the hit location 553-3 'is equal to the row number 553-2'.
 共起情報553-4は、カラム553-41、値553-42、共起率553-43を含む。情報検索システムは各除外語候補(候補語553-1’に格納される語)について、除外語候補とカラム(カラム301~カラム305)の各カラム値との共起率を算出する。そしてこのうち共起率が最も大きいカラム値を値553-42に格納し、そのカラム値の属するカラムの情報(カラム名)をカラム553-41に格納し、値553-42と除外語候補(候補語553-1’)の共起率を、共起率553-43に格納する。 The co-occurrence information 553-4 includes a column 553-41, a value 553-42, and a co-occurrence rate 553-43. For each excluded word candidate (word stored in candidate word 553-1 '), the information search system calculates a co-occurrence rate between the excluded word candidate and each column value of columns (column 301 to column 305). Of these, the column value with the highest co-occurrence rate is stored in the value 553-42, the column information (column name) to which the column value belongs is stored in the column 553-41, the value 553-42 and the excluded word candidate ( The co-occurrence rate of the candidate word 553-1 ′) is stored in the co-occurrence rate 553-43.
 続いて図16を用いて、実施例2に係る情報検索システムが作成するヒットワード情報500’のフォーマットについて説明する。ヒットワード情報500’はヒット箇所情報500と類似した情報だが、ステップ40(除外語候補ソートプログラム122’が実行する処理)の過程で作成される情報である。 Subsequently, the format of the hit word information 500 'created by the information search system according to the second embodiment will be described with reference to FIG. The hit word information 500 ′ is information similar to the hit location information 500, but is information created in the process of step 40 (processing executed by the excluded word candidate sort program 122 ′).
 検索語501、Length502は、実施例1で説明したヒット箇所情報500に含まれる情報と同じである。行数504’は、検索対象データ31内の行のうち、ヒットワードが含まれている行の数である。ただしヒットワードが含まれている行を計数する際、除外語候補ではないヒットワードが含まれている行の数だけが計数される。たとえば検索語が「ログ」で、除外語候補として「プログラム」が特定されている場合、検索対象データ31内に文字列「ログ」を含む行があったとしても、その行に含まれている文字列「ログ」が、除外語候補である「プログラム」内の「ログ」だった場合、そして文字列「プログラム」以外に「ログ」を含む文字列が存在しない場合、その行は計数されない。 Search word 501 and Length 502 are the same as the information included in hit location information 500 described in the first embodiment. The number of rows 504 ′ is the number of rows containing hit words among the rows in the search target data 31. However, when counting lines containing hit words, only the number of lines containing hit words that are not candidates for exclusion words is counted. For example, when the search term is “log” and “program” is specified as a candidate for the exclusion word, even if there is a line containing the character string “log” in the search target data 31, it is included in that line. If the character string “log” is “log” in “program” which is a candidate for excluded word, and if there is no character string including “log” other than the character string “program”, the line is not counted.
 共起情報505は、カラム505-1、値505-2、共起率505-3を含む。情報検索システムは、検索語とカラム(カラム301~カラム305)の各カラム値との共起率を算出し、そのうち共起率が最も大きいカラム値を値505-2に格納し、そのカラム値の属するカラム名をカラム505-1に格納し、値505-2と検索語の共起率を、共起率505-3に格納する。 The co-occurrence information 505 includes a column 505-1, a value 505-2, and a co-occurrence rate 505-3. The information search system calculates the co-occurrence rate between the search word and each column value of the columns (column 301 to column 305), stores the column value with the largest co-occurrence rate in the value 505-2, and the column value Column name is stored in the column 505-1, and the co-occurrence rate of the value 505-2 and the search word is stored in the co-occurrence rate 505-3.
 続いて、実施例2に係る情報検索システムによる検索処理の流れを説明する。まず、実施例2における検索プログラム120’が実行する処理は、実施例1で説明したもの(図5)とほとんど同じであるため、図5を参照しながら、実施例1で説明した検索処理と実施例2の検索処理の相違点を説明する。 Subsequently, a flow of search processing by the information search system according to the second embodiment will be described. First, the processing executed by the search program 120 ′ in the second embodiment is almost the same as that described in the first embodiment (FIG. 5), and therefore the search processing described in the first embodiment with reference to FIG. Differences in search processing according to the second embodiment will be described.
 実施例2における検索処理では、ステップ30、ステップ40で検索プログラム120’から呼び出される除外語候補抽出プログラム121’と除外語候補ソートプログラム122’が、除外語候補リスト550’とヒットワード情報500’を作成する点が、実施例1で説明した検索処理と異なる。またステップ50で検索プログラム120’は、クライアントプログラム221’に、ヒットワード情報500’と除外語候補リスト550’を送信し、クライアントプログラム221’はヒットワード情報500’と除外語候補リスト550’を用いて結果表示画面450’を出力装置25に表示する点が、実施例1で説明したものと異なる。 In the search process according to the second embodiment, the excluded word candidate extraction program 121 ′ and the excluded word candidate sort program 122 ′ called from the search program 120 ′ in step 30 and step 40 are the excluded word candidate list 550 ′ and the hit word information 500 ′. Is different from the search processing described in the first embodiment. In step 50, the search program 120 ′ transmits the hit word information 500 ′ and the excluded word candidate list 550 ′ to the client program 221 ′, and the client program 221 ′ stores the hit word information 500 ′ and the excluded word candidate list 550 ′. The result display screen 450 ′ is used to display the output device 25 on the output device 25, which is different from that described in the first embodiment.
 まず除外語候補抽出プログラム121’の処理について説明する。実施例2においても、除外語候補抽出プログラム121’は検索プログラム120’から呼び出されると、実施例1で説明した図10の処理を実行する。ステップ310~ステップ380は、実施例1で説明したものと同じであるため説明を略す。 First, the processing of the excluded word candidate extraction program 121 'will be described. Also in the second embodiment, when the excluded word candidate extraction program 121 ′ is called from the search program 120 ′, the processing of FIG. 10 described in the first embodiment is executed. Steps 310 to 380 are the same as those described in the first embodiment, and a description thereof will be omitted.
 ステップ390~ステップ410で除外語候補抽出プログラム121’は、除外語候補リスト550’を作成する。ステップ390では、まず除外語候補抽出プログラム121’は除外語候補を決定する。これは実施例1で説明したものと同じである。そして除外語候補抽出プログラム121’は、除外語候補を除外語候補リスト550’に記録する時、ヒット箇所553-3’に除外語の存在した行の行番号を記録する。ステップ410では除外語候補抽出プログラム121’は、除外語候補リスト550’の各候補553’について、ヒット箇所553-3’に記録されている行番号の数を、行数553-2’に記録する処理を行う。これら以外の点については、実施例1で説明した処理と同じである。 In step 390 to step 410, the excluded word candidate extraction program 121 'creates an excluded word candidate list 550'. In step 390, first, the excluded word candidate extraction program 121 'determines an excluded word candidate. This is the same as that described in the first embodiment. Then, when the excluded word candidate extraction program 121 'records the excluded word candidate in the excluded word candidate list 550', the line number of the line in which the excluded word exists is recorded in the hit location 553-3 '. In step 410, the excluded word candidate extraction program 121 ′ records the number of line numbers recorded in the hit location 553-3 ′ for each candidate 553 ′ in the excluded word candidate list 550 ′ in the number of lines 553-2 ′. Perform the process. About the point other than these, it is the same as the process demonstrated in Example 1. FIG.
 続いて、ステップ40で検索プログラム120’が除外語候補ソートプログラム122’を呼び出した時に行われる処理の詳細を、図17を用いて説明する。以下では、検索語が1つだけ指定されたケース(つまり、除外語候補リスト550’が1つだけ生成されているケース)について説明する。ただし実施例2に係る情報検索システムは、ユーザから検索語が複数指定されてもよい。ユーザから検索語が複数指定された場合には、作成された除外語候補リスト550’毎に、以下で説明する処理が実行される。 Subsequently, details of processing performed when the search program 120 'calls the excluded word candidate sorting program 122' in step 40 will be described with reference to FIG. Hereinafter, a case where only one search word is designated (that is, a case where only one excluded word candidate list 550 'is generated) will be described. However, the information search system according to the second embodiment may specify a plurality of search terms from the user. When a plurality of search terms are designated by the user, the processing described below is executed for each created excluded word candidate list 550 '.
 ステップ4010:除外語候補ソートプログラム122’は、除外語候補リスト550’を参照し、候補553’を1つ選択する。 Step 4010: The excluded word candidate sorting program 122 'refers to the excluded word candidate list 550' and selects one candidate 553 '.
 ステップ4020:除外語候補ソートプログラム122’は、ステップ4010で選択された候補553’のヒット箇所553-3’を参照し、検索対象データ31の行のうち、ヒット箇所553-3’に記録されている行番号の行をすべて読み出す。続いて除外語候補ソートプログラム122’は、読み出された行に含まれている各カラム値について、候補語553-1’とカラム値の共起率を算出する。共起率の定義(算出方法)は、上で述べたとおりである。 Step 4020: The excluded word candidate sorting program 122 ′ refers to the hit location 553-3 ′ of the candidate 553 ′ selected in Step 4010, and is recorded in the hit location 553-3 ′ in the search target data 31 row. Read all lines with the current line number. Subsequently, the excluded word candidate sorting program 122 'calculates the co-occurrence rate of the candidate word 553-1' and the column value for each column value included in the read row. The definition (calculation method) of the co-occurrence rate is as described above.
 そして除外語候補ソートプログラム122’は共起率が最大であったカラム値及びそのカラム値の属するカラムのカラム名をそれぞれ、候補553’内の値553-42、カラム553-41に格納し、また共起率を共起率553-43に格納する。 Then, the excluded word candidate sorting program 122 ′ stores the column value having the maximum co-occurrence rate and the column name of the column to which the column value belongs in the value 553-42 and the column 553-41 in the candidate 553 ′, respectively. The co-occurrence rate is stored in the co-occurrence rate 553-43.
 ステップ4030:除外語候補ソートプログラム122’は、除外語候補リスト550’内の全ての候補553’について、ステップ4020を実行したか判定する。全ての候補553’について、ステップ4020を実行した場合(ステップ4030:Yes)、次にステップ4040が行われる。未処理の候補553’が残っている場合(ステップ4030:No)、除外語候補ソートプログラム122’は再びステップ4010から処理を実施する。 Step 4030: The excluded word candidate sorting program 122 'determines whether step 4020 has been executed for all candidates 553' in the excluded word candidate list 550 '. When Step 4020 is executed for all candidates 553 '(Step 4030: Yes), Step 4040 is performed next. When the unprocessed candidate 553 'remains (Step 4030: No), the excluded word candidate sort program 122' performs the process from Step 4010 again.
 ステップ4040:ステップ4040~ステップ4050では、ヒットワード情報500’の作成が行われる。ステップ4040では除外語候補ソートプログラム122’は、ヒットワード情報500’にヒット箇所情報500の内容をコピーする。具体的には、ヒットワード情報500’の検索語501、Length502に、ヒット箇所情報500の検索語501、Length502をコピーする。続いて除外語候補ソートプログラム122’は、検索対象データ31の行のうち、ヒット箇所情報500の行番号503-1で特定される行をすべてメモリ12上に読み出す。さらに除外語候補ソートプログラム122’は、読み出された行のうち、除外語候補ではないヒットワードが含まれている行のみを残す。そして残された行の行数をヒットワード情報500’の行数504’に記録する。 Step 4040: In step 4040 to step 4050, the hit word information 500 'is created. In step 4040, the excluded word candidate sorting program 122 'copies the contents of the hit location information 500 to the hit word information 500'. Specifically, the search words 501 and 502 of the hit location information 500 are copied to the search words 501 and 502 of the hit word information 500 ′. Subsequently, the excluded word candidate sorting program 122 ′ reads all the rows specified by the row number 503-1 of the hit location information 500 among the rows of the search target data 31 onto the memory 12. Further, the excluded word candidate sorting program 122 'leaves only the lines including hit words that are not excluded word candidates among the read lines. The number of remaining lines is recorded in the number of lines 504 'of the hit word information 500'.
 ステップ4050:除外語候補ソートプログラム122’は、ステップ4040で読み出された行(除外語候補ではないヒットワードが含まれている行)の各カラム値について、ヒットワード(検索語)とカラム値の共起率を算出する。そして除外語候補ソートプログラム122’は、共起率が最大のカラム値及びそのカラム値の属するカラムのカラム名をそれぞれ、共起情報505内の値505-2、505-1に格納し、また共起率を共起率505-3に格納する。 Step 4050: The excluded word candidate sorting program 122 ′ determines the hit word (search word) and the column value for each column value in the row read out in Step 4040 (a row including hit words that are not excluded word candidates). The co-occurrence rate of is calculated. Then, the excluded word candidate sort program 122 ′ stores the column value having the maximum co-occurrence rate and the column name of the column to which the column value belongs in the values 505-2 and 505-1 in the co-occurrence information 505, respectively. The co-occurrence rate is stored in the co-occurrence rate 505-3.
 ステップ4060:ステップ4060~ステップ4080では、除外語候補リスト550’の複数の候補553’の並べ替えが行われる。ステップ4060では除外語候補ソートプログラム122’はまず、除外語候補リスト550’内のすべての候補553’を読み出す。続いて除外語候補ソートプログラム122’は読み出された複数の候補553’のうち、カラム553-41と値553-42が、ヒットワード情報500’のカラム505-1、値505-2と同じ候補553’を選択する。そして選択された候補553’を、共起率553-43の高い順にソートし、ソートされた候補553’を除外語候補リスト550’の候補領域553-0の先頭から格納する。 Step 4060: In Step 4060 to Step 4080, the plurality of candidates 553 'in the excluded word candidate list 550' are rearranged. In step 4060, the excluded word candidate sorting program 122 'first reads all candidates 553' in the excluded word candidate list 550 '. Subsequently, the excluded word candidate sort program 122 ′ has the same column 553-41 and value 553-42 as the column 505-1 and value 505-2 of the hit word information 500 ′ among the plurality of candidates 553 ′ read out. Candidate 553 ′ is selected. The selected candidates 553 'are sorted in descending order of the co-occurrence rate 553-43, and the sorted candidates 553' are stored from the top of the candidate area 553-0 in the excluded word candidate list 550 '.
 ステップ4070:続いて除外語候補ソートプログラム122’は、読み出された複数の候補553’のうち、カラム553-41がヒットワード情報500’のカラム505-1と一致するもの(ただし値553-42は値505-2と同一でないもの)を選択する。そしてここで選択された候補553’を、共起率553-43の高い順にソートし、ソートされた候補553’を順に候補領域553-0に格納する。 Step 4070: Subsequently, the excluded word candidate sorting program 122 ′ has a column 553-41 whose column word 553-41 matches the column 505-1 of the hit word information 500 ′ among the plurality of read candidates 553 ′ (however, the value 553— 42 is not the same as the value 505-2). The candidates 553 ′ selected here are sorted in descending order of the co-occurrence rate 553-43, and the sorted candidates 553 ′ are sequentially stored in the candidate area 553-0.
 ステップ4080:最後に除外語候補ソートプログラム122’は、読み出された複数の候補553’のうち、ステップ4060及びステップ407で選択されなかった候補553’を、共起率553-43の高い順にソートする。そして除外語候補ソートプログラム122’は、ソートされた候補553’を順に候補領域553-0に格納し、処理を終了する。この結果、除外語候補リスト550’の候補領域553-0内には各候補553’が、ステップ4060でソートされた候補553’、ステップ4070でソートされた候補553’、ステップ4080でソートされた候補553’の順に格納されることになる。 Step 4080: Finally, the excluded word candidate sorting program 122 ′ selects the candidates 553 ′ that are not selected in Step 4060 and Step 407 from the plurality of read candidates 553 ′ in the descending order of the co-occurrence rates 553-43. Sort. Then, the excluded word candidate sorting program 122 'stores the sorted candidates 553' in the candidate area 553-0 in order, and ends the process. As a result, each candidate 553 ′ is sorted in step 4060, candidate 553 ′ sorted in step 4070, candidate 553 ′ sorted in step 4080, and step 4080 in the candidate area 553-0 of the excluded word candidate list 550 ′. The candidates are stored in the order of candidates 553 ′.
 図18に、実施例2に係る情報検索システムがクライアント2の出力装置25(ディスプレイ)に表示する、結果表示画面450’の一例を示す。観点欄451と除外語候補欄453、除外語数欄454は、実施例1で説明した結果表示画面450のものと同じである。実施例2における結果表示画面450’ではこれらに加えて、共起カラム欄456、共起値欄457、共起率欄458が設けられ、これらの欄に、検索プログラム120’から送られてきた共起情報553-4が表示される。 FIG. 18 shows an example of a result display screen 450 ′ displayed on the output device 25 (display) of the client 2 by the information search system according to the second embodiment. The viewpoint column 451, the excluded word candidate column 453, and the excluded word number column 454 are the same as those in the result display screen 450 described in the first embodiment. In addition to these, a co-occurrence column column 456, a co-occurrence value column 457, and a co-occurrence rate column 458 are provided on the result display screen 450 ′ in the second embodiment, and sent from the search program 120 ′ to these columns. Co-occurrence information 553-4 is displayed.
 共起カラム欄456、共起値欄457、共起率欄458に表示される情報のうち、観点欄451と同じ高さに表示されている情報は、観点欄451に表示されている検索語についての情報である。図18の例では、検索語「ログ」との共起率が高かったカラム値「登録」が、共起値欄457に表示され、このカラム値の属するカラムのカラム名「コンポ」が共起カラム欄456に表示され、また検索語「ログ」とカラム値「登録」との共起率(83%)が、共起率欄458に表示されている。 Of the information displayed in the co-occurrence column column 456, the co-occurrence value column 457, and the co-occurrence rate column 458, the information displayed at the same height as the viewpoint column 451 is the search term displayed in the viewpoint column 451. Information about. In the example of FIG. 18, the column value “registration” having a high co-occurrence rate with the search term “log” is displayed in the co-occurrence value column 457, and the column name “component” of the column to which this column value belongs is co-occurrence. The co-occurrence rate (83%) of the search term “log” and the column value “registration” is displayed in the co-occurrence rate column 458.
 同様に、共起カラム欄456、共起値欄457、共起率欄458に表示される情報のうち、各除外語候補(除外語候補欄453)と同じ高さに表示されている情報は、各除外語候補についての情報である。図18の例では、除外語「アプリケーションログ」との共起率が高かったカラム値「登録」が、共起値欄457に表示され、このカラム値の属するカラムのカラム名「コンポ」が共起カラム欄456に表示され、また除外語「アプリケーションログ」とカラム値「登録」との共起率(100%)が、共起率欄458に表示されている。 Similarly, of the information displayed in the co-occurrence column column 456, the co-occurrence value column 457, and the co-occurrence rate column 458, information displayed at the same height as each excluded word candidate (excluded word candidate column 453) is , Information about each excluded word candidate. In the example of FIG. 18, the column value “register” having a high co-occurrence rate with the exclusion word “application log” is displayed in the co-occurrence value column 457, and the column name “component” of the column to which this column value belongs is shared. The co-occurrence rate (100%) of the excluded word “application log” and the column value “registration” is displayed in the co-occurrence rate column 456.
 実施例1におけるクライアントプログラム221と同様、実施例2におけるクライアントプログラム221’も、除外語候補リスト550’に格納されている候補553’の順に除外語候補を表示する。除外語候補リスト550’には図17で説明した処理に従って、検索語の共起情報505と近い(類似度が高い)共起情報553-4を有する除外語候補から順に格納されている。そのため、検索語の共起情報505と近い共起情報553-4を有する除外語候補から順に、結果表示画面450’に表示される。 Like the client program 221 in the first embodiment, the client program 221 ′ in the second embodiment also displays excluded word candidates in the order of candidates 553 ′ stored in the excluded word candidate list 550 ′. In the excluded word candidate list 550 ′, the excluded word candidates having the co-occurrence information 553-4 that is close (high in similarity) to the co-occurrence information 505 of the search word are sequentially stored according to the processing described in FIG. 17. Therefore, the excluded word candidates having the co-occurrence information 553-4 close to the search word co-occurrence information 505 are displayed in order on the result display screen 450 '.
 以上が、実施例2に係る情報検索システムの説明である。本実施例に係る情報検索システムでは、検索語に類似し、かつ検索対象データ中の特定カラムのカラム値との共起傾向が近い候補語を優先的に表示する。共起傾向が近い候補語は、検索語との関連性が高い語と推測されるため、ユーザにとって重要(考慮が必要)な語といえる。本実施例に係る情報検索システムでは、そのような語を優先的に表示することができる。 The above is the description of the information search system according to the second embodiment. In the information search system according to the present embodiment, candidate words that are similar to the search word and have a close co-occurrence tendency with the column value of the specific column in the search target data are preferentially displayed. Candidate words having a close co-occurrence tendency are presumed to be highly relevant to the search word, and can be said to be important (consideration required) for the user. In the information search system according to the present embodiment, such words can be preferentially displayed.
 以上、本発明の実施例を説明したが、これは、本発明の説明のための例示であって、本発明の範囲をこれらの実施例にのみ限定する趣旨ではない。すなわち、本発明は、他の種々の形態でも実施する事が可能である。 As mentioned above, although the Example of this invention was described, this is an illustration for description of this invention, Comprising: It is not the meaning which limits the scope of the present invention only to these Examples. That is, the present invention can be implemented in various other forms.
 たとえば上で説明した実施例に係る情報検索システムでは、検索システムサーバとは別に検索クライアントが設けられ、ユーザはクライアントの入力装置及び出力装置を用いる例が説明された。ただし、検索クライアントを設けることは必須ではなく、検索システムサーバでクライアントプログラムが実行される構成にしてもよい。その場合ユーザは、検索システムサーバの入力装置及び出力装置を用いて、情報検索のリクエストを発行するとよい。また、上で説明した実施例では、記憶装置と検索システムサーバは別の装置であったが、記憶装置が検索システムサーバ内に内蔵されていてもよい。 For example, in the information search system according to the embodiment described above, a search client is provided separately from the search system server, and the user uses an input device and an output device of the client. However, it is not essential to provide a search client, and the client program may be executed by the search system server. In this case, the user may issue an information search request using the input device and output device of the search system server. In the embodiment described above, the storage device and the search system server are separate devices, but the storage device may be built in the search system server.
 また、上で説明した実施例では、除外語候補抽出プログラム121はルール(1)またはルール(2)に基づいた除外語候補の特定を行っていた。ただし、除外語候補抽出プログラム121はルール(1)のみ(あるいはルール(2)のみ)に基づいた除外語候補特定処理を行わなければならないわけではない。たとえば、除外語候補抽出プログラム121はルール(1)に基づいた除外語候補特定処理と、ルール(2)に基づいた除外語候補特定処理の両方を行い、いずれの処理でも特定された候補語だけをユーザに提示するようにしてもよい。あるいはルール(1)に基づいた除外語候補特定処理で特定された語と、ルール(2)に基づいた除外語候補特定処理で特定された語の両方がユーザに提示されるようにしてもよい。 Further, in the embodiment described above, the excluded word candidate extraction program 121 identifies an excluded word candidate based on the rule (1) or the rule (2). However, the excluded word candidate extraction program 121 does not have to perform the excluded word candidate specifying process based on only the rule (1) (or only the rule (2)). For example, the excluded word candidate extraction program 121 performs both the excluded word candidate specifying process based on the rule (1) and the excluded word candidate specifying process based on the rule (2), and only the candidate words specified in any process are specified. May be presented to the user. Alternatively, both the word specified in the excluded word candidate specifying process based on the rule (1) and the word specified in the excluded word candidate specifying process based on the rule (2) may be presented to the user. .
 また、検索システムサーバの見つけ出した除外語候補が多い場合、全ての除外語候補を結果表示画面に表示してもよいが、一部の除外語候補を表示するようにしてもよい。たとえば、除外語候補リストに含まれている複数の除外語候補(候補553または候補553’)のうち、先頭のものだけを表示する、あるいは先頭からn個(たとえばn=3等の値である)の除外語候補だけを表示するようにしてもよい。また、上で説明した実施例では、サーバで実行されるプログラム(除外語候補ソート)が除外語候補のソートを行う(つまり除外語候補の表示順を決定する)例が説明されたが、別の実施形態として、クライアント2が除外語候補のソートを行うことで、除外語候補の表示順を決定してもよい。 Also, when there are many excluded word candidates found by the search system server, all excluded word candidates may be displayed on the result display screen, but some excluded word candidates may be displayed. For example, among the plural excluded word candidates (candidate 553 or candidate 553 ′) included in the excluded word candidate list, only the first one is displayed or n from the beginning (for example, n = 3 or the like). ) May be displayed only. In the embodiment described above, an example in which a program (excluded word candidate sort) executed on the server sorts excluded word candidates (that is, determines the display order of excluded word candidates) has been described. As an embodiment, the display order of the excluded word candidates may be determined by the client 2 sorting the excluded word candidates.
1:検索システムサーバ、2:検索クライアント、3:記憶装置、4:LAN 1: Search system server, 2: Search client, 3: Storage device, 4: LAN

Claims (15)

  1.  検索語の指定を受け付ける工程と、
     検索対象データ中に前記検索語が含まれている位置を特定する工程と、
     前記検索語が含まれている位置の前後に隣接する文字列である隣接文字列を特定する工程と、
     前記隣接文字列の特性に基づいて、前記隣接文字列が検索対象から除外されるべき語である除外語候補を構成する文字列であるか否かを判定する工程と、
     前記隣接文字列が除外語候補を構成する文字列と判定された場合に、前記隣接文字列と前記検索語を連結した語を、除外語候補と決定する工程と、
    を計算機が実行する、情報検索方法。
    Receiving a search term specification;
    Identifying a position where the search term is included in the search target data;
    Identifying adjacent character strings that are adjacent character strings before and after the position containing the search term;
    Determining whether the adjacent character string is a character string that constitutes an excluded word candidate that is a word to be excluded from a search target, based on the characteristics of the adjacent character string;
    When the adjacent character string is determined to be a character string that constitutes an excluded word candidate, determining a word obtained by connecting the adjacent character string and the search word as an excluded word candidate;
    An information retrieval method that is executed by a computer.
  2.  前記特性は、前記隣接文字列の文字種であり、
     前記判定する工程は、前記隣接文字列が前記検索語と同じ文字種の場合、前記隣接文字列が除外語候補を構成する文字列と判定する、
    請求項1に記載の情報検索方法。
    The characteristic is a character type of the adjacent character string,
    In the determining step, when the adjacent character string is the same character type as the search word, the adjacent character string is determined as a character string constituting an excluded word candidate.
    The information search method according to claim 1.
  3.  前記特性は、前記検索対象データにおける前記隣接文字列の出現頻度であり、
     前記判定する工程は、前記隣接文字列の出現頻度と前記検索語の出現頻度との比あるいは差が所定の範囲内にある場合、前記隣接文字列が除外語候補を構成する文字列と判定する、
    請求項1に記載の情報検索方法。
    The characteristic is an appearance frequency of the adjacent character string in the search target data,
    The determining step determines that the adjacent character string is a character string constituting an excluded word candidate when a ratio or difference between the appearance frequency of the adjacent character string and the appearance frequency of the search word is within a predetermined range. ,
    The information search method according to claim 1.
  4.  前記検索対象データに含まれている前記検索語の数と、前記決定する工程で決定された1または複数の前記除外語候補をユーザに提示する工程と、
     前記出力された1または複数の前記除外語候補の中から、ユーザに除外語を選択させる工程と、
     前記検索対象データに含まれている前記検索語の数から、前記選択された除外語の数を除いた数をユーザに提示する工程と、
    をさらに計算機が実行する、請求項1に記載の情報検索方法。
    The number of the search terms included in the search target data, and the step of presenting one or a plurality of excluded word candidates determined in the determining step to the user;
    Allowing the user to select an excluded word from the one or more output excluded word candidates;
    Presenting to the user a number obtained by subtracting the number of selected excluded words from the number of search words included in the search target data;
    The information search method according to claim 1, further comprising:
  5.  前記検索対象データは、複数の行と複数のカラムを有する表形式のデータであり、
     前記検索語が含まれている位置を特定する工程は、前記複数のカラムの中の第1のカラムに格納されているテキストを検索する事で、前記検索語が含まれている位置を特定する、
    請求項4に記載の情報検索方法。
    The search target data is tabular data having a plurality of rows and a plurality of columns,
    The step of specifying the position where the search word is included specifies the position where the search word is included by searching the text stored in the first column of the plurality of columns. ,
    The information search method according to claim 4.
  6.  前記除外語候補をユーザに提示する工程は、
     前記第1のカラム以外のカラムに含まれているカラム値毎に、前記検索語と前記カラム値との共起度を算出する工程と、
     前記除外語候補と、前記カラム値との共起度を算出する工程と、
     前記検索語と前記カラム値との共起度と、前記除外語候補と前記カラム値との共起度とに基づいて、ユーザに提示する前記除外語候補の優先順位を決定する工程と、
    を含む、請求項5に記載の情報検索方法。
    The step of presenting the exclusion word candidate to the user includes:
    Calculating the co-occurrence of the search term and the column value for each column value included in a column other than the first column;
    Calculating a co-occurrence degree of the exclusion word candidate and the column value;
    Determining a priority order of the excluded word candidates to be presented to a user based on the co-occurrence degree of the search word and the column value and the co-occurrence degree of the excluded word candidate and the column value;
    The information search method according to claim 5, comprising:
  7.  検索語の指定を受け付ける工程と、
     検索対象データ中に前記検索語が含まれている位置を特定する工程と、
     前記検索語が含まれている位置の前後に隣接する文字列である隣接文字列を特定する工程と、
     前記隣接文字列の特性に基づいて、前記隣接文字列が検索対象から除外されるべき語である除外語候補を構成する文字列であるか否かを判定する工程と、
     前記隣接文字列が除外語候補を構成する文字列と判定された場合に、前記隣接文字列と前記検索語を連結した語を、除外語候補と決定する工程と、
    を計算機に実行させるプログラムを記録した、コンピュータ読み取り可能な記憶媒体。
    Receiving a search term specification;
    Identifying a position where the search term is included in the search target data;
    Identifying adjacent character strings that are adjacent character strings before and after the position containing the search term;
    Determining whether the adjacent character string is a character string that constitutes an excluded word candidate that is a word to be excluded from a search target, based on the characteristics of the adjacent character string;
    When the adjacent character string is determined to be a character string that constitutes an excluded word candidate, determining a word obtained by connecting the adjacent character string and the search word as an excluded word candidate;
    A computer-readable storage medium storing a program for causing a computer to execute the program.
  8.  前記特性は、前記隣接文字列の文字種であり、
     前記判定する工程は、前記隣接文字列が前記検索語と同じ文字種の場合、前記隣接文字列が除外語候補を構成する文字列と判定する工程である、
    請求項7に記載のコンピュータ読み取り可能な記憶媒体。
    The characteristic is a character type of the adjacent character string,
    The step of determining is a step of determining, when the adjacent character string is the same character type as the search word, the adjacent character string as a character string constituting an excluded word candidate.
    The computer-readable storage medium according to claim 7.
  9.  前記特性は、前記検索対象データにおける前記隣接文字列の出現頻度であり、
     前記判定する工程は、前記隣接文字列の出現頻度と前記検索語の出現頻度の比あるいは差が所定の範囲内にある場合、前記隣接文字列が除外語候補を構成する文字列と判定する工程である、
    請求項7に記載のコンピュータ読み取り可能な記憶媒体。
    The characteristic is an appearance frequency of the adjacent character string in the search target data,
    In the determining step, when the ratio or difference between the appearance frequency of the adjacent character string and the appearance frequency of the search word is within a predetermined range, the adjacent character string is determined as a character string constituting an excluded word candidate. Is,
    The computer-readable storage medium according to claim 7.
  10.  前記検索対象データは、複数の行と複数のカラムを有する表形式のデータであり、
     前記検索語が含まれている位置を特定する工程は、前記複数のカラムの中の第1のカラムに格納されているテキストを検索する事で、前記検索語が含まれている位置を特定する、
    請求項7に記載のコンピュータ読み取り可能な記憶媒体。
    The search target data is tabular data having a plurality of rows and a plurality of columns,
    The step of specifying the position where the search word is included specifies the position where the search word is included by searching the text stored in the first column of the plurality of columns. ,
    The computer-readable storage medium according to claim 7.
  11.  前記第1のカラム以外のカラムに含まれているカラム値毎に、前記検索語と前記カラム値との共起度を算出する工程と、
     前記除外語候補と、前記カラム値との共起度を算出する工程と、
     前記検索語と前記カラム値との共起度と、前記除外語候補と前記カラム値との共起度とに基づいて、ユーザに提示する前記除外語候補の優先順位を決定する工程と、
     前記検索対象データに含まれている前記検索語の数と、ユーザに提示する前記除外語候補をユーザに提示する工程と、
    をさらに計算機が実行する、請求項10に記載のコンピュータ読み取り可能な記憶媒体。
    Calculating the co-occurrence of the search term and the column value for each column value included in a column other than the first column;
    Calculating a co-occurrence degree of the exclusion word candidate and the column value;
    Determining a priority order of the excluded word candidates to be presented to a user based on the co-occurrence degree of the search word and the column value and the co-occurrence degree of the excluded word candidate and the column value;
    Presenting to the user the number of search terms included in the search target data and the excluded word candidates to be presented to the user;
    The computer-readable storage medium according to claim 10, further executed by a computer.
  12.  検索語の指定を受け付ける工程と、
     検索対象データ中に前記検索語が含まれている位置を特定する工程と、
     前記検索語が含まれている位置の前後に隣接する文字列である隣接文字列を特定する工程と、
     前記隣接文字列の特性に基づいて、前記隣接文字列が検索対象から除外されるべき語である除外語候補を構成する文字列であるか否かを判定する工程と、
     前記隣接文字列が除外語候補を構成する文字列と判定された場合に、前記隣接文字列と前記検索語を連結した語を、除外語候補と決定する工程と、
    を実行する、情報検索システム。
    Receiving a search term specification;
    Identifying a position where the search term is included in the search target data;
    Identifying adjacent character strings that are adjacent character strings before and after the position containing the search term;
    Determining whether the adjacent character string is a character string that constitutes an excluded word candidate that is a word to be excluded from a search target, based on the characteristics of the adjacent character string;
    When the adjacent character string is determined to be a character string that constitutes an excluded word candidate, determining a word obtained by connecting the adjacent character string and the search word as an excluded word candidate;
    An information retrieval system that executes.
  13.  前記特性は、前記隣接文字列の文字種であり、
     前記判定する工程は、前記隣接文字列が前記検索語と同じ文字種の場合、前記隣接文字列が除外語候補を構成する文字列と判定する工程である、
    請求項12に記載の情報検索システム。
    The characteristic is a character type of the adjacent character string,
    The determining step is a step of determining, when the adjacent character string is the same character type as the search word, that the adjacent character string is a character string constituting an excluded word candidate.
    The information search system according to claim 12.
  14.  前記特性は、前記検索対象データにおける前記隣接文字列の出現頻度であり、
     前記判定する工程は、前記隣接文字列の出現頻度と前記検索語の出現頻度の比あるいは差が所定の範囲内にある場合、前記隣接文字列が除外語候補を構成する文字列と判定する工程である、
    請求項12に記載の情報検索システム。
    The characteristic is an appearance frequency of the adjacent character string in the search target data,
    In the determining step, when the ratio or difference between the appearance frequency of the adjacent character string and the appearance frequency of the search word is within a predetermined range, the adjacent character string is determined as a character string constituting an excluded word candidate. Is,
    The information search system according to claim 12.
  15.  前記検索対象データは、複数の行と複数のカラムを有する表形式のデータであり、
     前記検索語が含まれている位置を特定する工程は、前記複数のカラムの中の第1のカラムに格納されているテキストを検索する事で、前記検索語が含まれている位置を特定する工程であって、
     さらに、
     前記第1のカラム以外のカラムに含まれているカラム値毎に、前記検索語と前記カラム値との共起度を算出する工程と、
     前記除外語候補と、前記カラム値との共起度を算出する工程と、
     前記検索語と前記カラム値との共起度と、前記除外語候補と前記カラム値との共起度とに基づいて、ユーザに提示する前記除外語候補の優先順位を決定する工程と、
     前記検索対象データに含まれている前記検索語の数と、ユーザに提示する前記除外語候補をユーザに提示する工程と、
    を実行する、請求項12に記載の情報検索システム。
    The search target data is tabular data having a plurality of rows and a plurality of columns,
    The step of specifying the position where the search word is included specifies the position where the search word is included by searching the text stored in the first column of the plurality of columns. A process,
    further,
    Calculating the co-occurrence of the search term and the column value for each column value included in a column other than the first column;
    Calculating a co-occurrence degree of the exclusion word candidate and the column value;
    Determining a priority order of the excluded word candidates to be presented to a user based on the co-occurrence degree of the search word and the column value and the co-occurrence degree of the excluded word candidate and the column value;
    Presenting to the user the number of search terms included in the search target data and the excluded word candidates to be presented to the user;
    The information retrieval system according to claim 12, wherein:
PCT/JP2016/051566 2016-01-20 2016-01-20 Information search method WO2017126057A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/051566 WO2017126057A1 (en) 2016-01-20 2016-01-20 Information search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/051566 WO2017126057A1 (en) 2016-01-20 2016-01-20 Information search method

Publications (1)

Publication Number Publication Date
WO2017126057A1 true WO2017126057A1 (en) 2017-07-27

Family

ID=59362178

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/051566 WO2017126057A1 (en) 2016-01-20 2016-01-20 Information search method

Country Status (1)

Country Link
WO (1) WO2017126057A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06259481A (en) * 1993-03-03 1994-09-16 Hitachi Ltd Character string collating method and device equipped with same character classification longest matching collating function
JP2001034623A (en) * 1999-07-19 2001-02-09 Matsushita Electric Ind Co Ltd Information retrievel method and information reteraval device
JP2002269139A (en) * 2001-03-08 2002-09-20 Ricoh Co Ltd Method for retrieving document
JP2006172372A (en) * 2004-12-20 2006-06-29 Dainippon Printing Co Ltd Retrieval device and method
JP2010250389A (en) * 2009-04-10 2010-11-04 Internatl Business Mach Corp <Ibm> Information retrieval system, method and program, and index generation system, method, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06259481A (en) * 1993-03-03 1994-09-16 Hitachi Ltd Character string collating method and device equipped with same character classification longest matching collating function
JP2001034623A (en) * 1999-07-19 2001-02-09 Matsushita Electric Ind Co Ltd Information retrievel method and information reteraval device
JP2002269139A (en) * 2001-03-08 2002-09-20 Ricoh Co Ltd Method for retrieving document
JP2006172372A (en) * 2004-12-20 2006-06-29 Dainippon Printing Co Ltd Retrieval device and method
JP2010250389A (en) * 2009-04-10 2010-11-04 Internatl Business Mach Corp <Ibm> Information retrieval system, method and program, and index generation system, method, and program

Similar Documents

Publication Publication Date Title
JP5010885B2 (en) Document search apparatus, document search method, and document search program
US20060149557A1 (en) Sentence displaying method, information processing system, and program product
US20220012231A1 (en) Automatic content-based append detection
JP5900367B2 (en) SEARCH DEVICE, SEARCH METHOD, AND PROGRAM
US20220222292A1 (en) Method and system for ideogram character analysis
US10216792B2 (en) Automated join detection
US20090276418A1 (en) Information processing apparatus, information processing method, information processing program and recording medium
US9965546B2 (en) Fast substring fulltext search
US9317189B1 (en) Method to input content in a structured manner with real-time assistance and validation
JP6533876B2 (en) Product information display system, product information display method, and program
US11645312B2 (en) Attribute extraction apparatus and attribute extraction method
CN107145947B (en) Information processing method and device and electronic equipment
WO2017126057A1 (en) Information search method
WO2021051600A1 (en) Method, apparatus and device for identifying new word based on information entropy, and storage medium
CN114220113A (en) Paper quality detection method, device and equipment
JP7101946B2 (en) Search system
JP5752073B2 (en) Data correction device
JP6807201B2 (en) Information processing device
JP2010092108A (en) Similar sentence extraction program, method, and apparatus
JPH08180066A (en) Index preparation method, document retrieval method and document retrieval device
WO2024079833A1 (en) Information processing device, output method, and output program
US20230096564A1 (en) Chunking execution system, chunking execution method, and information storage medium
JP2017117109A (en) Information processing device, information processing system, information retrieval method, and program
JP2006039811A (en) Document management program, document management method and document management device
JP4922030B2 (en) Character string search apparatus, method and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16886299

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16886299

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP