WO2017126057A1

WO2017126057A1 - Information search method

Info

Publication number: WO2017126057A1
Application number: PCT/JP2016/051566
Authority: WO
Inventors: 岐勇飯島; 敦畠山; 翔太葛西; 裕介水藤
Original assignee: 株式会社日立製作所
Priority date: 2016-01-20
Filing date: 2016-01-20
Publication date: 2017-07-27

Abstract

In an information search system which is an embodiment of the present invention, upon receiving the specification of a search term, the position at which the specified search term is included, and character strings appearing before and after said position (adjacent character strings), are identified from data being searched. On the basis of the characteristics of the adjacent character strings, an assessment is made as to whether the adjacent character strings are character strings which constitute terms to be excluded from the subject of the search (excluded terms). If it is determined that the adjacent character stings constitute excluded terms, a determination is made that a term in which said character strings are conjoined with the search term is an excluded term candidate.

Description

Information retrieval method

The present invention relates to an information search method.

Conventionally, there is a search technology for searching a search target character string (hereinafter referred to as “search word”) from search target documents. In many conventional search techniques, an index of a target document is created, and a method for searching for a document requested by a user by referring to the index when a search request is made by a user has been mainstream. However, in this case, the information search cannot be performed unless the document is indexed in advance, and therefore, a full-text search technology that matches the search target document with the search term when a search request from a user is accepted also appears. ing.

One problem with full-text search of search target documents is that words that are irrelevant to the search word (hereinafter referred to as “search noise”) are included in the search results. For example, when a full-text search is performed using “log” as a search word, a word that includes the character string “log” but is not related to “log” is also searched, such as “program”. This is a particularly common problem when searching for documents written in a language where there is no space between words, such as Japanese.

As a technique for removing search noise, Patent Literature 1 specifies a word specifying dictionary and a word including a search word character string as a partial character string (hereinafter, this word is referred to as an “extended word”). A technique for removing an extension word from a search result by using the extension word dictionary is disclosed. For example, it is assumed that “log” is designated as a search word character string, and an extension word (for example, “program”) including “log” as a partial character string is registered in the extension word dictionary. In this case, since the “program” included in the search target document is excluded from the search result and reported, the search noise can be reduced.

Japanese Patent Laid-Open No. 11-73429

In the technology disclosed in Patent Document 1, it is necessary to have an extension word dictionary in advance in order to remove noise from the search result. Conversely, when a word that is not registered in the extended word dictionary is included in the search target document, the word cannot be removed as noise. In addition, having an extended word dictionary is expensive.

When the information search method of the present invention receives the specification of the search word, the position including the specified search word and the character string adjacent to the position (adjacent character string) are specified from the search target data. To do. Then, based on the characteristics of the adjacent character string, it is determined whether the adjacent character string is a character string constituting a word (excluded word) to be excluded from the search target, and when it is determined as a character string constituting the excluded word Determines a word obtained by connecting the character string and the search word as an excluded word candidate.

According to the present invention, even when the information search / analyzer does not have prior knowledge about the search / analysis target data, search noise can be reduced.

It is a block diagram of a search system. It is an example of search object data. It is an example of the screen for input. It is an example of a result display screen. It is a flowchart of a search process. It is a figure showing the format of hit location information. It is a figure showing the format of an exclusion word candidate list. It is a figure explaining the concept of rule (1). It is a figure explaining the concept of a rule (2). It is a flowchart of an exclusion word candidate extraction process. It is explanatory drawing of a front pointer and a back pointer. It is a figure explaining the concept of rule (1). It is a figure showing the format of appearance frequency information. It is an example of the search object data used in Example 2. It is a figure showing the format of the exclusion word candidate list | wrist in Example 2. FIG. It is a figure showing the format of the hit word information in Example 2. FIG. 12 is a flowchart of excluded word candidate sorting processing according to the second embodiment. It is an example of the result display screen in Example 2.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. The embodiments described below do not limit the invention according to the claims, and all the elements and combinations described in the embodiments are essential for the solution of the invention. Is not limited.

In the following description, processing executed by a computer such as a host may be described using “program” as the subject. Actually, since the processing described in the program is performed when the CPU (Central Processing Unit) executes the program, the main subject of the processing is the processor (CPU), but in order to prevent redundant explanation The contents of the process may be explained using the program as the subject. Further, part or all of the program may be realized by dedicated hardware. Various programs described below may be provided by a storage medium that can be read by a program distribution server or a computer, and may be installed in each device that executes the program. The computer-readable storage medium is a non-transitory computer-readable medium such as a non-volatile storage medium such as an IC card, an SD card, or a DVD.

The configuration of the information search system according to the first embodiment of the present invention will be described below. FIG. 1 is a diagram illustrating a hardware configuration of the information search system. The information search system includes a search system server 1 (hereinafter abbreviated as “server 1”), a search client 2 (hereinafter abbreviated as “client 2”), and a storage device 3. The server 1 and the client 2 are connected so as to be able to communicate with each other via a local area network (LAN) 4 configured using, for example, Ethernet. The server 1 is connected to the storage device 3 via a network 5 (or referred to as SAN 5) configured using, for example, a fiber channel (FibreChannel).

The server 1 is a computer that performs a search process based on an information search request received from a user of an information search system (hereinafter referred to as “user”). The server 11 is referred to as a main memory (hereinafter referred to as “memory”). 12), a network port 13 for connecting to the LAN 4, an input device 14, an output device 15, and a storage port 16. The memory 12 is a storage device such as a DRAM, for example, and is used to store the program or control information used when the program is executed when the CPU 11 executes the program. The CPU 11 is a component that executes a program for executing search processing. The input device 14 is a device used when the user inputs information, such as a keyboard and a mouse. The output device 15 is, for example, a display (output) device such as a display or a printer. In this embodiment, the output device 15 is a display. The storage port 16 is an interface for connecting the server 1 and the storage device 3.

The client 2 is a computer that is used by the user to instruct the server 1 to search for information or to receive an output of a search result from the server 1. The client 2 includes a CPU 21, a main memory 22, a network port 23 for connecting to the LAN 4, an input device 24, and an output device (display) 25. The CPU 21, the memory 22, the network port 23, the input device 24, and the output device 25 are the same as the CPU 11, the memory 12, the network port 13, the input device 14, and the output device 15 of the server 1, respectively. In addition to the main memory 22, the client 2 may include an auxiliary storage device such as a magnetic disk.

The storage device 3 is a device having a nonvolatile storage device such as a magnetic disk, and is a device for storing information to be searched by the user. In the present embodiment, information to be searched by the user is referred to as “search target data” (“search target data 31” in the figure). The storage device 3 may be a device having a plurality of nonvolatile storage devices such as a so-called disk array (or RAID). The storage device 3 is connected to the storage port 16 of the server 1 via the SAN 5.

The memory 12 of the server 1 stores a program executed by the server 1 and control information used by the program. Examples of programs executed by the server 1 include a search program 120, an excluded word candidate extraction program 121, and an excluded word candidate sort program 122. When the server 1 receives an information search instruction from the user, these programs are executed by the CPU 11. Note that these programs are stored in the storage device 3 when the search process is not performed, and are read from the storage device 3 onto the memory 12 when the search process is performed. Control information used by these programs includes hit location information 500, an excluded word candidate list 550, and appearance frequency information 600. Details of these will be described later. The server 1 may store in the memory 12 programs other than the programs described above and information other than the control information described above.

The client program 221 exists in the main memory 22 of the client 2, and the CPU 21 executes the client program 221. The client program 221 is a program that provides a GUI (Graphical User Interface) for a user to issue an information search instruction.

Next, the format of the search target data will be described with reference to FIG. As an example, the search target data 31 targeted in the information search process according to the present embodiment is tabular data composed of a plurality of rows and a plurality of columns, as shown in FIG. The search target data 31 has a plurality of columns 301 to 306. The top row of the search target data 31 is the column name of each column 301-306. In this embodiment, the columns 301 to 305 are columns that store typical information (structured data) such as names and dates. A column 306 stores text written in a natural language. The information stored in the column 306 is atypical data (unstructured data). Information to be searched in the information search system according to the present embodiment is unstructured data stored in the column 306. The columns 301 to 305 may store information expressed as numerical values such as dates, or may store information other than numerical values such as names. However, in each embodiment described below, even if the information stored in the columns 301 to 305 is character information other than numerical values (for example, a character string such as “search” or “registration” is recorded in the column 304). The information stored in the columns 301 to 305 is referred to as “value”.

Although an example in which structured data is stored in the columns 301 to 305 will be described here, unstructured data may be stored in the columns 301 to 305. In this embodiment, an example in which the search target data is tabular data as shown in FIG. 2 will be described. However, the search target data is not limited to tabular data. For example, data described in XML or the like may be used, or even a single text file or a collection of a plurality of text files may be a search target in the information search system according to the present embodiment.

FIG. 3 shows an example of a search word input screen provided to the user by the information search system. An input screen 400 in FIG. 3 is a screen that the client program 221 outputs (displays) to the output device 25 of the client 2, and the user can input a search term in the input field 401. As shown in FIG. 3, a plurality of search terms may be entered in the input field 401. When the user presses the button 402 using the mouse or keyboard that is the input device 24, the client program 221 notifies the server 1 of the search term input in the input field 401 and causes the server 1 to perform a search.

In the first embodiment, the information search system obtains the number of cases in which the search term input in the input field 401 is included from the atypical text information stored in the column 306 of the search target data 31. FIG. 4 shows an example of the result display screen 450. A search term specified by the user on the input screen 400 is displayed in the viewpoint column 451, and a numerical value 452 (referred to as the number 452) displayed below the viewpoint column 451 is a search included in the search target data 31. This is the number of words (words displayed in the viewpoint column 451). When three search terms are input as in the example of FIG. 3, the result display screen 450 displays the number 452 for each of the three search terms.

In addition, a column 453 on the right side of the viewpoint column 451 is a column for displaying candidates for excluded words. In this embodiment, a word (character string) including a search word searched from the search target data 31 by an information search system (specifically, server 1) is called an excluded word candidate. In the example of FIG. 4, “Log” is displayed in one of the viewpoint columns 451. When the server 1 finds a word including “log” from the search target data 31 (“program”, “blog”, “application log” in the example of FIG. 4), it extracts it as an excluded word candidate. . At the same time, the server 1 counts the number of excluded word candidates existing in the search target data 31. The excluded word candidates and the number thereof are sent to the client program 221, and the client program 221 displays the excluded word candidates in the excluded word candidate column 453, and the number of excluded word candidates is the number of excluded words on the right side of the excluded word candidate column 453. This is displayed in the column 454. Further, the value displayed in the number of cases 452 at this time includes the number of excluded word candidates.

A check box 455 is displayed on the right side of the excluded word count column 454. A check box 455 exists for each candidate for excluded word. When the user turns on the check box 455 of a certain excluded word candidate using the mouse or the like and presses the Refresh button 460, the word displayed in the excluded word candidate column 453 of the line in which the check box 455 is turned on is Excluded from search. As a result, the number of excluded word candidates (the number displayed in the excluded word number column 454) in the row where the check box 455 is turned on is subtracted from the value displayed in the number of cases 452. In this embodiment, of the excluded word candidates, those excluded from the search target, that is, those whose check box 455 is turned on by the user are referred to as “excluded words”.

Subsequently, the flow of processing executed by the information search system will be described with reference to FIG. When the user inputs a search word using the input screen 400 and presses the button 402, a search instruction is transmitted from the client program 221 to the search program 120, and the processing of FIG. 5 is started accordingly. 5 and the subsequent figures, the alphabet “S” attached before the reference number means “step”. In the following description, an example in which only one search word (a word input in the input field 401) is designated will be described unless otherwise specified. However, as described above, the information search system according to the present embodiment is configured to accept a plurality of search terms.

Step 10: The search program 120 reads the search target data 31 and determines whether the search target data 31 (column 306) includes a word that matches the search word. In the present embodiment, the search program 120 does not need to read all the search target data 31, and only needs to read the text stored in the column 306. In the present embodiment, a word that is included in the search target data 31 and matches the search word is referred to as a “hit word”. The contents of the search word and the hit word are naturally the same, but not the word specified by the user (search word) but the character string (word that matches the search word) existing in the search target data 31 and its position The expression “hit word” is used when the user wants to specify.

When there is a hit word in the search target data 31 (column 306), the search program 120 records in the memory 12 information on the position where the hit word exists (this is referred to as a “hit location”). An example of hit location information (hereinafter referred to as “hit location information 500”) stored in the memory 12 is shown in FIG. The hit location information 500 is created for each search term.

The search term specified from the client program 221 is stored in the search term 501, and the length (number of characters) of the search term is stored in the Length 502. After Length 502, one or more pieces of information (hit location 503) about the position where the hit word exists are stored. The hit location 503 is composed of two pieces of information, a row number 503-1 and an offset 503-2. The line number 503-1 is a number stored in the column 301 of the search target data 31, and the offset 503-2 represents the position from the head of the text stored in the column 306 of the search target data 31.

For example, if the search term is “log” and the text “in this program...” Is stored in the first column 306 of the search target data 31, the fourth byte from the beginning of the column 306 in this row. “Log” is present. In this case, the search program 120 creates a hit location 503 in which “1” is stored in the row number 503-1 and “4” is stored in the offset 503-2, and is stored in the memory 12. However, the hit location information 500 shown in FIG. 6 is an example, and the format of the hit location information may be arbitrary as long as the information that uniquely identifies the position of the hit word is recorded in the memory 12. The frequency 504 stores the number of times that the search word 501 appears in the search target data 31. This is equal to the number of hit locations 503.

Step 20: As a result of Step 10, when the hit location information 500 is not generated (that is, when the search target data 31 does not include a search word), the search program 120 ends. When the hit location information 500 is generated (step 20: Yes), the search program 120 next executes step 30.

Step 30: The search program 120 calls the excluded word candidate extraction program 121 to specify an excluded word candidate. The excluded word candidate extraction program 121 called from the search program 120 identifies an excluded word candidate using the hit location information 500. At this time, the excluded word candidate extraction program 121 counts the number (number of appearances) of each excluded word candidate existing in the search target data 31, and the excluded word candidate is information that records the excluded word candidate and the number of appearances. A list 550 is created and recorded in the memory 12.

The format of the excluded word candidate list 550 stored in the memory 12 is shown in FIG. The excluded word candidate list 550 includes a search word 551, a Length 552, and a candidate 553. A search term 551 and a length 552 are the length of the search term and the search term specified from the client program 221, respectively.

The candidate 553 is information composed of a candidate word 553-1 and the number of appearances 553-2. When a plurality of excluded word candidates are found, a plurality of candidates 553 exist in the excluded word candidate list 550. The candidate word 553-1 is an excluded word candidate specified by the search program 120 (exactly, the excluded word candidate extraction program 121), and the number of appearances 553-2 is that the candidate word 553-1 appears in the search target data 31. Is the number of times. The contents of the processing of the excluded word candidate extraction program 121 will be described later.

Step 40: The search program 120 calls the excluded word candidate sorting program 122 to rearrange the excluded word candidates specified in step 30. The called excluded word candidate sorting program 122 sorts each candidate 553 when there are a plurality of candidates 553 in the excluded word candidate list 550. However, in the first embodiment, sorting is not an essential process. Therefore, in the information search system according to the first embodiment, the search program 120 may not execute step 40.

The sorting method is arbitrary. For example, the candidates 553 may be sorted in descending order of the number of appearances 553-2. In this case, when step 40 is executed, candidates 553 are stored in the excluded word candidate list 550 in descending order of appearance count 553-2.

Step 50: The search program 120 causes the client program 221 to display a result display screen 450. For this purpose, the search program 120 transmits the search word 501 and the frequency 504 included in the hit location information 500 and the excluded word candidate list 550 to the client program 221. When there are a plurality of search words (for example, n), a plurality (n sets) of sets of the search word 501, the frequency 504, and the excluded word candidate list 550 are transmitted.

The client program 221 that has received the information from the search program 120 displays the result display screen 450 in FIG. 4 on the output device 25 and waits for an instruction from the user. When there are a plurality of excluded word candidates (candidates 553) sent for one search word, the client program 221 displays the excluded word candidates in the order of the candidates 553 stored in the excluded word candidate list 550. If the candidates 553 stored in the excluded word candidate list 550 are sorted so as to be stored in descending order of the number of appearances (appearance number 553-2) in step 40, the client program 221 selects the excluded word candidate as the number of appearances. It will be displayed in order from most.

Here, when the user turns on the check box 455 for a word that is to be excluded from the search number among the words displayed in the excluded word candidate column 453 and presses the Refresh button 460, the client program 221 turns on the check box 455. The search candidate 120 is notified of the excluded word candidate.

Step 60: The search program 120 calculates the number of excluded word candidates notified from the client program 221 and the result (the search word 501 and the frequency 504 included in the hit location information 500, and the excluded word candidate list 550). To the client program 221. Upon receiving the notification, the client program 221 redisplays the result display screen 450. The re-display is performed in the same manner as the processing performed in step 50.

Step 70: When the user turns on the check box 455 for the excluded word candidate and presses the Refresh button 460 on the result display screen 450 re-displayed in Step 60, the processing in Step 60 is performed again. If the user does not re-specify the check box 455 for the candidate exclusion word, the search process ends.

Next, the processing of the excluded word candidate extraction program 121 performed in step 30, that is, the specified method of excluded word candidates will be described. The excluded word candidate extraction program 121 according to the present embodiment is configured such that characters (or character strings) adjacent to the hit word included in the search target data 31 are connected to the hit word based on a predetermined rule. It is determined whether it is a character (or character string) that constitutes an excluded word candidate. Hereinafter, characters (or character strings) adjacent to the hit word are referred to as “adjacent character strings”. The adjacent character string may be one character, or may be two or more characters. Of the adjacent character strings, the characters (or character strings) connected to the hit word and constituting the excluded word candidate are referred to as excluded character strings.

In this embodiment, two types of rules will be described as the predetermined rule, but other rules may be used for specifying an excluded word candidate. Hereinafter, two types of rules will be described.

(1) Exclusion word candidate identification rule based on character type In this rule, it is determined whether or not the adjacent character string is an excluded character string based on the character type of the adjacent character string. The character type here means katakana, hiragana, kanji, and the like. Hereinafter, this rule is referred to as “rule (1)”. The excluded word candidate extraction program 121 can specify the character type of the adjacent character string or hit word from the character code of the adjacent character string or hit word. In rule (1), it is determined character by character whether or not it is an excluded character string. In the following description, the character to be determined is referred to as “determination target character”.

First, when the determination target character is a character indicating a delimiter (hereinafter referred to as a delimiter character) such as a punctuation mark (.,), A space, or a line feed, it is determined that the determination target character is not an excluded character string. . Delimiters include punctuation marks, spaces, line breaks, tabs, colons, semicolons, or parentheses.

Also, when the determination target character is not a delimiter, it is determined whether or not it is an excluded character string by comparing with a character adjacent to the determination target character. If the determination target character is the same character type as the character adjacent to the determination target character, the determination target character is determined to be an excluded character string. This is because, when both character types are the same, the determination target character is highly likely to be a character included in the same word as the character adjacent to the determination target character. Specifically, when both the determination target character and the character adjacent to the determination target character are katakana (or hiragana, kanji), the determination target character is determined to be an excluded character string. When the character adjacent to the determination target character is a kanji character, if the determination target character is any one of katakana, hiragana, or kanji, the adjacent character string is determined to be an excluded character string.

Conversely, when the determination target character is a character type different from the character adjacent to the determination target character, it is determined that the adjacent character string is not an excluded character string. Specifically, when the determination target character is hiragana and the character adjacent to the determination target character is katakana (or vice versa), the adjacent character string is not an excluded character string.

Referring to FIG. 8, an example of an excluded word candidate identification method based on rule (1) will be described. The example in FIG. 8 assumes a case where the search term is “log” and the search target data 31 includes the character string “in this program”. In this example, the characters adjacent to the front of the hit word “log” are “P”, “NO”,. Since “P” has the same character type (Katakana) as the hit word “Log”, it is determined as an excluded character string. On the other hand, “no”, which is a character concatenated in front of “p”, is a character type (hiragana) different from “p”, which is a character adjacent to “no”, and is therefore not an excluded character string. Is done.

The same determination is performed for characters ("La", "M", "DE" ...) connected to the back of the hit word (log). In other words, among the characters concatenated after the hit word (log), “La” and “M” are the same character type (Katakana) as the hit word, so “La” and “M” are determined to be excluded character strings. Is done. Since “de”, which is a character concatenated behind “m”, is a character type (hiragana) different from the character “m” adjacent to “de”, it is determined that it is not an excluded character string.

(2) Exclusion word identification rule using 2-gram (bi-gram) method In this rule, an adjacent character string is an excluded character string based on the comparison result of the appearance frequency of the search word and the appearance frequency of the adjacent character string. Is determined. Hereinafter, this rule is referred to as “rule (2)”. An example of an excluded word candidate specifying method using rule (2) will be described with reference to FIG.

As in FIG. 8, the example of FIG. 9 also assumes that the search term is “log” and the search target data 31 includes the character string “in this program”. FIG. 9 shows an example in which the number of occurrences (appearance frequency) of the character string “log”, which is a hit word matching the search word, appears in the search target data 31 is 12,000. In the case of rule (2), the appearance frequency in the search target data 31 of the character string adjacent to the hit word is first obtained. At this time, the length of the adjacent character string may be arbitrary, but in this embodiment, an example in which the length of the adjacent character string is 2 will be described.

In rule (2), as in the case of creating a so-called bi-gram index, the appearance frequency is obtained for each character string of two characters in the search target data 31. In the example of FIG. 9, “Pro”, which is a character string of two characters located in front of the hit word “Log”, appears in the search target data 31 10000 times, and behind “Log”. The number of occurrences of “gra” and “ram”, which are two character strings located in the search target data 31, is 8000 times. That is, the number of appearances of the character strings “pro”, “gra”, and “ram” are all the same as the number of appearances of the search term “log” (12,000 times). In rule (2), when the number of appearances of a two-character string is approximately the same as the number of appearances of a search word, these two-character strings (in the example of FIG. 9, “pro”, “gra”, “Ram”) is determined to be an excluded character string. This is because a character string (adjacent character string) that appears at a frequency close to the appearance frequency of the search word is highly likely to be a character string that is concatenated with the search word (hit word) to form one word.

On the other hand, the number of appearances of the two-character string “NO” located in front of the character string “PRO” is 800 times, and the number of appearances of the two-character string “MU” located behind the character string “ram” Is 100 times, which is greatly different from the number of appearances of the search term “log” (12000 times). For this reason, in the rule (2), it is determined that the two character strings “NO” and “MU” are not excluded character strings. Similarly, a character string positioned in front of “NO” (such as “this”) and a character string positioned behind “MU” (such as “in”) are also determined not to be excluded characters. Is done. There are various methods for determining whether or not the number of appearances of a two-character string and the number of appearances of a search word are similar. For example, there may be a method of determining whether the absolute value of the difference between the two is 0 or more and less than a predetermined value, or a method of determining whether the ratio between the two is a value close to 1.

The process flow of the excluded word candidate extraction program 121 performed in step 30 will be described with reference to FIG. In the information search system according to the present embodiment, either an excluded word candidate specifying method based on the rule (1) or an excluded word candidate specifying method based on the rule (2) can be selected. The storage device 3 stores a connection rule 32, and the connection rule 32 stores information on the rules (1) and (2) described above. The excluded word candidate extraction program 121 reads either rule (1) or rule (2) from the connection rule 32 based on designation from the user. Alternatively, the administrator of the information search system different from the user determines a rule (rule (1) or rule (2)) to be read by the excluded word candidate extraction program 121, and the search program 120 (or excluded word candidate extraction program). 121), a rule to be read may be designated.

Step 310: The excluded word candidate extraction program 121 reads the information of the rule (1) or the rule (2) from the connection rule 32 in the storage device 3. In the following, a case where the information of rule (1) is read in step 310 (that is, a case where an excluded word candidate is specified based on rule (1)) will be described.

Step 320: The excluded word candidate extraction program 121 selects one hit location 503 from the hit location information 500, and specifies the appearance position of the hit word in the search target data 31. Further, the excluded word candidate extraction program 121 prepares two variables (a forward pointer H and a backward pointer T) for specifying an adjacent character string of a hit word. As described in the description of the hit location information 500, two types of information, a row number and an offset, are used to specify the position in the search target data 31, but the forward pointer H and the backward pointer T have an offset. Only stored. However, as another embodiment, a set of row number and offset may be stored in the forward pointer H and the backward pointer T.

In step 320, the excluded word candidate extraction program 121 sets initial values for the forward pointer H and the backward pointer T. The forward pointer H and the backward pointer T will be described with reference to FIG. The forward pointer H is a pointer indicating a character located before the appearance position of the hit word, and the initial value stores a position one character before the start position of the hit word. In the example of FIG. 11, the position of “log” that is the hit word (more precisely, the position of “ro” that is the first character of the character string [log]) is (line number = 22, offset = 4). Therefore, 3 (= 4-1) is set as the initial value of the forward pointer H. The backward pointer T is a pointer that points to the character located behind the hit word, and the initial value stores the position of the character next to the end of the hit word. In the example of FIG. 11, the position information of “La”, which is the adjacent character immediately after “Log”, which is the hit word, that is, 6 is stored as an initial value.

Step 340: The excluded word candidate extraction program 121 determines the character string pointed to by the forward pointer H. The determination method is as described above. When rule (1) is used, the excluded word candidate extraction program 121 determines whether the selected character string is an excluded character string based on the character type. Therefore, the excluded word candidate extraction program 121 compares the character type of the character pointed to by the forward pointer H and the character adjacent to the back of the character. As a result of the comparison, if the character pointed to by the forward pointer H matches the rule (1) described above, the excluded word candidate extraction program 121 determines that the character pointed to by the forward pointer H is an excluded character string. If the character pointed to by the forward pointer H is a delimiter, the excluded word candidate extraction program 121 determines that the character pointed to by the forward pointer H is not an excluded character string.

Step 350: The excluded word candidate extraction program 121 updates the value of the forward pointer H so that the forward pointer H points to the character immediately preceding the character currently pointed to. Specifically, the excluded word candidate extraction program 121 subtracts 1 from the value of the forward pointer H. Thereafter, the excluded word candidate extraction program 121 executes Step 340 for the character string pointed to by the forward pointer H updated here. The excluded word candidate extraction program 121 repeats the processing in steps 340 and 350 until the character string pointed to by the forward pointer H is not determined to be an excluded character string (or until the forward pointer H becomes 0).

Referring to the example of FIG. 11, first, the forward pointer H points to the character “P”. Therefore, in the determination in step 340, it is determined that the character “P” is the same character type (katakana) as the search word (log), and therefore step 350 is executed next. In step 350, the excluded word candidate extraction program 121 causes the forward pointer H to point to the character adjacent to the front of the character “P” (that is, “NO”) (H is subtracted by 1 to 2), and again. The determination in step 340 is performed.

The character “NO” is a character type (Hiragana) different from the character “P” adjacent to “NO”. Therefore, when the excluded word candidate extraction program 121 determines in step 340 for the character “NO”, it is determined that the selected character “NO” is not an excluded character string. As a result, in the example of FIG. 11, the character “p” is determined to be an excluded character string, but character strings (“no” and “ko”) positioned in front of it are determined not to be excluded character strings. The

Step 370: The excluded word candidate extraction program 121 determines the character string pointed to by the backward pointer T. The determination method is the same as in step 340. That is, the excluded word candidate extraction program 121 compares the character pointed to by the backward pointer T and the character type of the character adjacent to the front of the character, and determines whether or not they are the same character type (if they are the same character type, The character pointed to by the pointer T is determined as an excluded character string). If the character pointed to by the backward pointer T is a delimiter, it is determined that the character pointed to by the backward pointer T is not an excluded character string. If it is determined that the selected character string is an excluded character string (step 370: connected), step 380 is performed next, and otherwise step 390 is performed.

Step 380: The excluded word candidate extraction program 121 updates the value of the backward pointer T so that the backward pointer T points to the character immediately after the character currently pointed to. Here, the reverse of the method for updating the forward pointer H is preferably performed. That is, the excluded word candidate extraction program 121 may add 1 to the backward pointer T. One character adjacent to the rear of the character string determined in step 370 is selected. The excluded word candidate extraction program 121 determines that the character string pointed to by the backward pointer T is determined not to be an excluded character string (or until the backward pointer T points to the end of the search target data 31). The process of step 380 is repeated.

Referring to the example of FIG. 11, first, the backward pointer T points to the character “La”. Then, in the determination in step 370, it is determined that the character “ra” is the same character type (katakana) as the search word (log), so step 380 is executed next. In step 380, the excluded word candidate extraction program 121 updates the backward pointer T so that the character adjacent to the character “La” (that is, “M”) is indicated (add 1 to T to 7). Then, the determination in step 370 is performed again.

Since the character “M” has the same character type as the character “La” adjacent to the character “M”, the selected character string “M” is determined to be an excluded character string by the determination in Step 370. Therefore, the excluded word candidate extraction program 121 executes Step 380 again. By executing step 380, the backward pointer T points to the character “de”. Therefore, the determination in step 370 is performed for the character “de”.

Since the character “de” is a character type (Hiragana) that is different from the character “m” that is adjacent to the front of the character “de”, it is determined in step 370 that the character “de” is not an excluded character string. As a result, in the example of FIG. 11, the characters “La” and “M” are determined to be excluded character strings, but the character strings (“de” and “ha”) located behind the character strings are excluded character strings. It is judged that it is not.

Step 390: The excluded word candidate extraction program 121 adds 1 to the forward pointer H and subtracts 1 from the backward pointer T. Then, the excluded word candidate extraction program 121 determines a character string having the character pointed to by the forward pointer H as the first character and the character pointed to by the backward pointer T as the terminal character as an excluded word candidate, and the determined exclusion The word candidate is recorded in the candidate 553 of the excluded word candidate list 550. For example, when the processing up to step 380 is executed on the character string shown in FIG. 11, the forward pointer H points to the character “NO” and the backward pointer T points to the character “DE”. Yes. In step 390, the excluded word candidate extraction program 121 adds 1 to the offset of the forward pointer H and subtracts 1 from the offset of the backward pointer T. As a result, the forward pointer H points to the character “p”, and the backward pointer T "". Therefore, the character string “program” is determined as an excluded word candidate.

Of course, there are cases in which no excluded word candidates can be obtained. For example, as in the example shown in FIG. 12, when the search target data 31 includes the character string “log for detection”, all the characters positioned before and after the hit word “log” Hiragana is different from the character type (Katakana) of the hit word “Log”. In this case, it is determined that the characters (“Ni” and “Ga”) located before and after the hit word “log” are not excluded character strings, and therefore no excluded word candidates are obtained. In this case, the process of recording the excluded word candidate in the excluded word candidate list 550 is not performed. Further, even when the excluded word candidate obtained in step 390 is already recorded in the excluded word candidate list 550, the process of recording the excluded word candidate in the excluded word candidate list 550 is not performed.

Step 400: The excluded word candidate extraction program 121 determines whether or not the processing of Step 320 to Step 380 has been executed for all the hit locations 503 stored in the hit location information 500. If the processing of step 320 to step 380 has not been executed for all hit locations 503 (step 400: No), the excluded word candidate extraction program 121 performs the processing from step 320 again. If the processing of step 320 to step 380 has been executed for all hit locations 503 (step 400: Yes), then step 410 is performed.

Step 410: The excluded word candidate extraction program 121 counts the number of appearances in the search target data 31 for each excluded word candidate recorded in the excluded word candidate list 550. Then, the excluded word candidate extraction program 121 records the counted number of appearances in the excluded word candidate list 550 (number of appearances 553-2), and ends the process.

In the case where the rule (1) is read in step 310, the exclusion word candidate is specified according to the flow described above. Next, a case where rule (2) is read in step 310 will be described. In this case as well, the processing is performed in the same flow as described above, and therefore, the description below will focus on differences from the above description.

When the rule (2) is read in step 310, the excluded word candidate extraction program 121 counts the appearance frequency of each character string in the search target data 31 and creates the appearance frequency information 600 before executing step 320. To do. An example of the appearance frequency information 600 is shown in FIG. The excluded word candidate extraction program 121 decomposes all data in the search target data 31 (in the column 306) into character strings of two characters by the same method as the bi-gram method.

As a simple example, when the search target data 31 includes a character string “in this program” as shown in FIG. 9, the excluded word candidate extraction program 121 reads “this” and “ “,“ Pro ”, etc., are extracted and stored in the character string 610 column of the appearance frequency information 600. Thereafter, the excluded word candidate extraction program 121 counts the number of times each character string stored in the character string 610 column appears in the search target data 31 and stores the count result in the appearance number column 620.

Since the processing performed in step 320 is the same as that described in the case where the rule (1) is read, description thereof is omitted here.

In step 340, first, the excluded word candidate extraction program 121 selects a two-character string composed of a character pointed to by the forward pointer H and a character (one character) adjacent to the rear of the character. For example, as shown in FIG. 11, when the forward pointer H points to the character “P”, a character string (“P”) and a character string “B” adjacent to the rear of “P” ( That is, “Pro”) is selected. Then, the excluded word candidate extraction program 121 compares the appearance count information 620 of the selected character string with the appearance count of the hit word by referring to the appearance frequency information 600 created in step 310, so that both are comparable. It is determined whether it appears at the frequency of. As described above, various methods can be selected as a method for determining whether or not both appear at the same frequency. For example,
“Number of occurrences of selected character string ÷ Number of occurrences of hit word”
Is within a predetermined range (for example, 0.5 to 2), it may be determined that both appear at the same frequency and the selected character string is an excluded character string.

If both appear at the same frequency (step 340: connected), then step 350 is executed. On the other hand, if both do not appear at the same frequency (step 340: no connection), then step 360 is executed.

In step 350, the excluded word candidate extraction program 121 subtracts 1 from the forward pointer H. That is, it is the same as that described in the case where the rule (1) is read.

In step 370, the excluded word candidate extraction program 121 selects a two-character string composed of a character pointed to by the backward pointer T and a character (one character) adjacent in front thereof. For example, as shown in FIG. 11, when the backward pointer T points to the character “La”, a character string (“La” and a character string “G” adjacent to the front of “La” ( That is, “gra” is selected. Then, the excluded word candidate extraction program 121 compares the appearance frequency information 600 created in step 310 and compares the number of appearances 620 of the selected character string with the number of appearances of the hit word, so that both are comparable. It is determined whether it appears at the frequency of. This determination is the same as the determination performed in step 340. As a result of the determination, if both appear with the same frequency (step 370: connection), then step 350 is executed, and if both do not appear with the same frequency (step 370: no connection) Step 390 is then executed.

In step 380, the excluded word candidate extraction program 121 adds 1 to the backward pointer T, as described above. Thereafter, the process of step 370 is performed. The excluded word candidate extraction program 121 performs the processing in steps 370 and 380 until the selected character string is not determined to be an excluded character string (or until the backward pointer T indicates the end of the search target data 31). repeat.

The processing from step 390 to step 410 is the same as the processing described in the case where rule (1) is read.

The above is the method of identifying excluded word candidates by the excluded word candidate extraction program 121. In the above, the example in which the information on the rule (1) or the rule (2) is first read by the excluded word candidate extraction program 121 has been described. However, as another embodiment, the information on the rule (1) or the rule (2) is preliminarily stored. It may be embedded in the excluded word candidate extraction program 121. In that case, step 310 need not be executed.

The above is the information search method according to the first embodiment. The information search system according to the first embodiment specifies excluded word candidates based on the characteristics (character type and appearance frequency) of adjacent character strings of hit words, and presents the specified excluded word candidates to the user. Since only the information related to the adjacent character string of the hit word (characteristic information such as character type and appearance frequency) is used to specify the excluded word candidate, the information search system according to the present embodiment searches for the excluded word candidate. No dictionary is needed. Then, the user can specify a word (exclusion word) to be excluded from the search result from the candidate exclusion words presented by the information search system, and when the exclusion word is specified, the information search system designates the specified exclusion word. Present search results that do not contain. Thereby, the information search system according to the present embodiment can reduce the search noise from the search result even when the user does not have much knowledge about the data to be searched and analyzed.

In the first embodiment, an example in which the number of appearances of a search word (or an excluded word candidate) is output (displayed) as a search result has been described. However, the method of providing a search result to a user is only to provide the number of appearances. It is not limited. For example, the information search system may be configured such that the contents of the line including the hit word are output to the output device 25. In the first embodiment, the example in which the information search system outputs (displays) the number of search words (or excluded word candidates) included in the search target data 31 as the number of appearances has been described. The number of lines including the search term may be output (displayed).

Subsequently, an information search method according to the second embodiment will be described. The hardware configuration of the information search system according to the second embodiment is the same as that described in the first embodiment.

In the information search system according to the second embodiment, the server 1 executes a search program 120 ′, an excluded word candidate extraction program 121 ′, and an excluded word candidate sort program 122 ′. Since these programs are almost the same as the search program 120, the excluded word candidate extraction program 121, and the excluded word candidate sort program 122 described in the first embodiment, the differences will be mainly described below.

On the other hand, in the client 2, the client program 221 'is executed. Similar to the client program 221 described in the first embodiment, the client program 221 ′ provides a GUI (Graphical User Interface) for a user to issue an information search instruction. However, the content of the information output to the output device 25 is slightly different from that in the first embodiment.

The information search system according to Example 1 determines whether or not it is an excluded word candidate based on the character type and appearance frequency of the adjacent character string of the hit word. The information search system according to the second embodiment also makes the same determination. In the information search system according to the first embodiment, only atypical text information (column 306) in the search target data 31 is used. However, in the information search system according to the second embodiment, structured data in columns 301 to 305 is used. Is used to determine the display order of excluded word candidates.

The concept of the method performed in the information search system according to the second embodiment will be described with reference to FIG. FIG. 14 shows an example of the search target data 31, and the contents are the same as those shown in FIG.

Also in the information search system according to the second embodiment, excluded word candidates are determined by the same method as described in the first embodiment. When the search word is “log”, the result of searching the text data in the column 306 is, for example, “blog” or “program” is identified as an excluded word candidate.

The information search system according to the second embodiment further determines the display priority of the excluded word candidate by using values stored in columns other than the column 306 of the search target data 31. For example, referring to a column other than the column 306 for each row in which the same word as the search word “log” appears, the same value may be stored. For example, in the example of FIG. 14, there are two lines in which the search term “log” appears. In each column 304 (column whose column name is “component”), the word “register” is stored as a value. That is, in the example of FIG. 14, the value “register” in the column 304 is a value co-occurring with the search term “log”.

Also, when a column other than the column 306 is referred to for a line in which an excluded word candidate (for example, “blog”) appears, there may be a value co-occurring with the excluded word candidate. When a value with a high co-occurrence rate with a candidate for a negative word and a value with a high co-occurrence rate with a search word are common, the candidate for a negative word has a high similarity to the search word and is highly important for the user It is thought that. Since it is better to display such words preferentially, in the information search system according to this embodiment, a value having a high co-occurrence rate with the search word and a value having a high co-occurrence rate with the excluded word candidate are obtained. Extraction is performed, and the display priority of the excluded word candidates is determined based on the extracted words.

The definition of the co-occurrence rate in this example will be described. First, the co-occurrence rate of a hit word and a certain value (this is called a column value) in a certain column (columns other than the column 306) is determined as follows. Let A be the number of rows in which the hit word exists in each row of the search target data 31. Also, let B be the number of rows in which column values are stored in columns other than the column 306 among rows in which hit words exist. In this case, the co-occurrence rate of the hit word and this column value is defined as B / A.

Also, the co-occurrence rate of excluded word candidates and column values is determined in the same way. That is, the number of rows in the search target data 31 in which there are excluded word candidates is A ′, and the number of rows in which column values are stored in columns other than the column 306 among the rows in which there are excluded word candidates is B ′. , The co-occurrence rate of the excluded word candidate and this column value is defined as B ′ ÷ A ′.

Subsequently, information used in the information search system according to the second embodiment will be described. In the information search system according to the first embodiment, the hit location information 500, the excluded word candidate list 550, and the appearance frequency information 600 are created. Among these, the same information is created for the hit location information 500 and the appearance frequency information 600 even in the information search system according to the second embodiment. Therefore, detailed description of these pieces of information is not performed in the second embodiment.

Also, the information search system according to the second embodiment creates an excluded word candidate list 550 'instead of the excluded word candidate list 550 described in the first embodiment. Furthermore, the information search system according to the second embodiment creates hit word information 500 '. Hereinafter, these two pieces of information will be described.

The format of the excluded word candidate list 550 'will be described with reference to FIG. The excluded word candidate list 550 'is created by the excluded word candidate extraction program 121' and the excluded word candidate sort program 122 '. The excluded word candidate list 550 'includes a search word 551, a Length 552, and a candidate 553'. Search words 551 and Length 552 are the same as those included in the excluded word candidate list 550 in the first embodiment. One or more candidates 553 'exist in the excluded word candidate list 550'. In the excluded word candidate list 550 ', the area where the candidate 553' is stored (the area immediately after the Length 552) is called a candidate area 553-0.

Candidate 553 'includes candidate word 553-1', number of lines 553-2 ', hit location 553-3', and co-occurrence information 553-4. Candidate word 553-1 'is the same as candidate word 553-1 described in the first embodiment. The number of rows 553-2 'is the number of rows in the search target data 31 that include the candidate word 553-1'. As shown in FIG. 15, the hit location 553-3 ′ includes one or more row numbers of rows in which the candidate word 553-1 ′ exists among the rows in the search target data 31. The number of row numbers stored in the hit location 553-3 'is equal to the row number 553-2'.

The co-occurrence information 553-4 includes a column 553-41, a value 553-42, and a co-occurrence rate 553-43. For each excluded word candidate (word stored in candidate word 553-1 '), the information search system calculates a co-occurrence rate between the excluded word candidate and each column value of columns (column 301 to column 305). Of these, the column value with the highest co-occurrence rate is stored in the value 553-42, the column information (column name) to which the column value belongs is stored in the column 553-41, the value 553-42 and the excluded word candidate ( The co-occurrence rate of the candidate word 553-1 ′) is stored in the co-occurrence rate 553-43.

Subsequently, the format of the hit word information 500 'created by the information search system according to the second embodiment will be described with reference to FIG. The hit word information 500 ′ is information similar to the hit location information 500, but is information created in the process of step 40 (processing executed by the excluded word candidate sort program 122 ′).

Search word 501 and Length 502 are the same as the information included in hit location information 500 described in the first embodiment. The number of rows 504 ′ is the number of rows containing hit words among the rows in the search target data 31. However, when counting lines containing hit words, only the number of lines containing hit words that are not candidates for exclusion words is counted. For example, when the search term is “log” and “program” is specified as a candidate for the exclusion word, even if there is a line containing the character string “log” in the search target data 31, it is included in that line. If the character string “log” is “log” in “program” which is a candidate for excluded word, and if there is no character string including “log” other than the character string “program”, the line is not counted.

The co-occurrence information 505 includes a column 505-1, a value 505-2, and a co-occurrence rate 505-3. The information search system calculates the co-occurrence rate between the search word and each column value of the columns (column 301 to column 305), stores the column value with the largest co-occurrence rate in the value 505-2, and the column value Column name is stored in the column 505-1, and the co-occurrence rate of the value 505-2 and the search word is stored in the co-occurrence rate 505-3.

Subsequently, a flow of search processing by the information search system according to the second embodiment will be described. First, the processing executed by the search program 120 ′ in the second embodiment is almost the same as that described in the first embodiment (FIG. 5), and therefore the search processing described in the first embodiment with reference to FIG. Differences in search processing according to the second embodiment will be described.

In the search process according to the second embodiment, the excluded word candidate extraction program 121 ′ and the excluded word candidate sort program 122 ′ called from the search program 120 ′ in step 30 and step 40 are the excluded word candidate list 550 ′ and the hit word information 500 ′. Is different from the search processing described in the first embodiment. In step 50, the search program 120 ′ transmits the hit word information 500 ′ and the excluded word candidate list 550 ′ to the client program 221 ′, and the client program 221 ′ stores the hit word information 500 ′ and the excluded word candidate list 550 ′. The result display screen 450 ′ is used to display the output device 25 on the output device 25, which is different from that described in the first embodiment.

First, the processing of the excluded word candidate extraction program 121 'will be described. Also in the second embodiment, when the excluded word candidate extraction program 121 ′ is called from the search program 120 ′, the processing of FIG. 10 described in the first embodiment is executed. Steps 310 to 380 are the same as those described in the first embodiment, and a description thereof will be omitted.

In step 390 to step 410, the excluded word candidate extraction program 121 'creates an excluded word candidate list 550'. In step 390, first, the excluded word candidate extraction program 121 'determines an excluded word candidate. This is the same as that described in the first embodiment. Then, when the excluded word candidate extraction program 121 'records the excluded word candidate in the excluded word candidate list 550', the line number of the line in which the excluded word exists is recorded in the hit location 553-3 '. In step 410, the excluded word candidate extraction program 121 ′ records the number of line numbers recorded in the hit location 553-3 ′ for each candidate 553 ′ in the excluded word candidate list 550 ′ in the number of lines 553-2 ′. Perform the process. About the point other than these, it is the same as the process demonstrated in Example 1. FIG.

Subsequently, details of processing performed when the search program 120 'calls the excluded word candidate sorting program 122' in step 40 will be described with reference to FIG. Hereinafter, a case where only one search word is designated (that is, a case where only one excluded word candidate list 550 'is generated) will be described. However, the information search system according to the second embodiment may specify a plurality of search terms from the user. When a plurality of search terms are designated by the user, the processing described below is executed for each created excluded word candidate list 550 '.

Step 4010: The excluded word candidate sorting program 122 'refers to the excluded word candidate list 550' and selects one candidate 553 '.

Step 4020: The excluded word candidate sorting program 122 ′ refers to the hit location 553-3 ′ of the candidate 553 ′ selected in Step 4010, and is recorded in the hit location 553-3 ′ in the search target data 31 row. Read all lines with the current line number. Subsequently, the excluded word candidate sorting program 122 'calculates the co-occurrence rate of the candidate word 553-1' and the column value for each column value included in the read row. The definition (calculation method) of the co-occurrence rate is as described above.

Then, the excluded word candidate sorting program 122 ′ stores the column value having the maximum co-occurrence rate and the column name of the column to which the column value belongs in the value 553-42 and the column 553-41 in the candidate 553 ′, respectively. The co-occurrence rate is stored in the co-occurrence rate 553-43.

Step 4030: The excluded word candidate sorting program 122 'determines whether step 4020 has been executed for all candidates 553' in the excluded word candidate list 550 '. When Step 4020 is executed for all candidates 553 '(Step 4030: Yes), Step 4040 is performed next. When the unprocessed candidate 553 'remains (Step 4030: No), the excluded word candidate sort program 122' performs the process from Step 4010 again.

Step 4040: In step 4040 to step 4050, the hit word information 500 'is created. In step 4040, the excluded word candidate sorting program 122 'copies the contents of the hit location information 500 to the hit word information 500'. Specifically, the

search words

501 and 502 of the hit location information 500 are copied to the

search words

501 and 502 of the hit word information 500 ′. Subsequently, the excluded word candidate sorting program 122 ′ reads all the rows specified by the row number 503-1 of the hit location information 500 among the rows of the search target data 31 onto the memory 12. Further, the excluded word candidate sorting program 122 'leaves only the lines including hit words that are not excluded word candidates among the read lines. The number of remaining lines is recorded in the number of lines 504 'of the hit word information 500'.

Step 4050: The excluded word candidate sorting program 122 ′ determines the hit word (search word) and the column value for each column value in the row read out in Step 4040 (a row including hit words that are not excluded word candidates). The co-occurrence rate of is calculated. Then, the excluded word candidate sort program 122 ′ stores the column value having the maximum co-occurrence rate and the column name of the column to which the column value belongs in the values 505-2 and 505-1 in the co-occurrence information 505, respectively. The co-occurrence rate is stored in the co-occurrence rate 505-3.

Step 4060: In Step 4060 to Step 4080, the plurality of candidates 553 'in the excluded word candidate list 550' are rearranged. In step 4060, the excluded word candidate sorting program 122 'first reads all candidates 553' in the excluded word candidate list 550 '. Subsequently, the excluded word candidate sort program 122 ′ has the same column 553-41 and value 553-42 as the column 505-1 and value 505-2 of the hit word information 500 ′ among the plurality of candidates 553 ′ read out. Candidate 553 ′ is selected. The selected candidates 553 'are sorted in descending order of the co-occurrence rate 553-43, and the sorted candidates 553' are stored from the top of the candidate area 553-0 in the excluded word candidate list 550 '.

Step 4070: Subsequently, the excluded word candidate sorting program 122 ′ has a column 553-41 whose column word 553-41 matches the column 505-1 of the hit word information 500 ′ among the plurality of read candidates 553 ′ (however, the value 553— 42 is not the same as the value 505-2). The candidates 553 ′ selected here are sorted in descending order of the co-occurrence rate 553-43, and the sorted candidates 553 ′ are sequentially stored in the candidate area 553-0.

Step 4080: Finally, the excluded word candidate sorting program 122 ′ selects the candidates 553 ′ that are not selected in Step 4060 and Step 407 from the plurality of read candidates 553 ′ in the descending order of the co-occurrence rates 553-43. Sort. Then, the excluded word candidate sorting program 122 'stores the sorted candidates 553' in the candidate area 553-0 in order, and ends the process. As a result, each candidate 553 ′ is sorted in step 4060, candidate 553 ′ sorted in step 4070, candidate 553 ′ sorted in step 4080, and step 4080 in the candidate area 553-0 of the excluded word candidate list 550 ′. The candidates are stored in the order of candidates 553 ′.

FIG. 18 shows an example of a result display screen 450 ′ displayed on the output device 25 (display) of the client 2 by the information search system according to the second embodiment. The viewpoint column 451, the excluded word candidate column 453, and the excluded word number column 454 are the same as those in the result display screen 450 described in the first embodiment. In addition to these, a co-occurrence column column 456, a co-occurrence value column 457, and a co-occurrence rate column 458 are provided on the result display screen 450 ′ in the second embodiment, and sent from the search program 120 ′ to these columns. Co-occurrence information 553-4 is displayed.

Of the information displayed in the co-occurrence column column 456, the co-occurrence value column 457, and the co-occurrence rate column 458, the information displayed at the same height as the viewpoint column 451 is the search term displayed in the viewpoint column 451. Information about. In the example of FIG. 18, the column value “registration” having a high co-occurrence rate with the search term “log” is displayed in the co-occurrence value column 457, and the column name “component” of the column to which this column value belongs is co-occurrence. The co-occurrence rate (83%) of the search term “log” and the column value “registration” is displayed in the co-occurrence rate column 458.

Similarly, of the information displayed in the co-occurrence column column 456, the co-occurrence value column 457, and the co-occurrence rate column 458, information displayed at the same height as each excluded word candidate (excluded word candidate column 453) is , Information about each excluded word candidate. In the example of FIG. 18, the column value “register” having a high co-occurrence rate with the exclusion word “application log” is displayed in the co-occurrence value column 457, and the column name “component” of the column to which this column value belongs is shared. The co-occurrence rate (100%) of the excluded word “application log” and the column value “registration” is displayed in the co-occurrence rate column 456.

Like the client program 221 in the first embodiment, the client program 221 ′ in the second embodiment also displays excluded word candidates in the order of candidates 553 ′ stored in the excluded word candidate list 550 ′. In the excluded word candidate list 550 ′, the excluded word candidates having the co-occurrence information 553-4 that is close (high in similarity) to the co-occurrence information 505 of the search word are sequentially stored according to the processing described in FIG. 17. Therefore, the excluded word candidates having the co-occurrence information 553-4 close to the search word co-occurrence information 505 are displayed in order on the result display screen 450 '.

The above is the description of the information search system according to the second embodiment. In the information search system according to the present embodiment, candidate words that are similar to the search word and have a close co-occurrence tendency with the column value of the specific column in the search target data are preferentially displayed. Candidate words having a close co-occurrence tendency are presumed to be highly relevant to the search word, and can be said to be important (consideration required) for the user. In the information search system according to the present embodiment, such words can be preferentially displayed.

As mentioned above, although the Example of this invention was described, this is an illustration for description of this invention, Comprising: It is not the meaning which limits the scope of the present invention only to these Examples. That is, the present invention can be implemented in various other forms.

For example, in the information search system according to the embodiment described above, a search client is provided separately from the search system server, and the user uses an input device and an output device of the client. However, it is not essential to provide a search client, and the client program may be executed by the search system server. In this case, the user may issue an information search request using the input device and output device of the search system server. In the embodiment described above, the storage device and the search system server are separate devices, but the storage device may be built in the search system server.

Further, in the embodiment described above, the excluded word candidate extraction program 121 identifies an excluded word candidate based on the rule (1) or the rule (2). However, the excluded word candidate extraction program 121 does not have to perform the excluded word candidate specifying process based on only the rule (1) (or only the rule (2)). For example, the excluded word candidate extraction program 121 performs both the excluded word candidate specifying process based on the rule (1) and the excluded word candidate specifying process based on the rule (2), and only the candidate words specified in any process are specified. May be presented to the user. Alternatively, both the word specified in the excluded word candidate specifying process based on the rule (1) and the word specified in the excluded word candidate specifying process based on the rule (2) may be presented to the user. .

Also, when there are many excluded word candidates found by the search system server, all excluded word candidates may be displayed on the result display screen, but some excluded word candidates may be displayed. For example, among the plural excluded word candidates (candidate 553 or candidate 553 ′) included in the excluded word candidate list, only the first one is displayed or n from the beginning (for example, n = 3 or the like). ) May be displayed only. In the embodiment described above, an example in which a program (excluded word candidate sort) executed on the server sorts excluded word candidates (that is, determines the display order of excluded word candidates) has been described. As an embodiment, the display order of the excluded word candidates may be determined by the client 2 sorting the excluded word candidates.

1: Search system server, 2: Search client, 3: Storage device, 4: LAN

Claims

Receiving a search term specification;
Identifying a position where the search term is included in the search target data;
Identifying adjacent character strings that are adjacent character strings before and after the position containing the search term;
Determining whether the adjacent character string is a character string that constitutes an excluded word candidate that is a word to be excluded from a search target, based on the characteristics of the adjacent character string;
When the adjacent character string is determined to be a character string that constitutes an excluded word candidate, determining a word obtained by connecting the adjacent character string and the search word as an excluded word candidate;
An information retrieval method that is executed by a computer.
The characteristic is a character type of the adjacent character string,
In the determining step, when the adjacent character string is the same character type as the search word, the adjacent character string is determined as a character string constituting an excluded word candidate.
The information search method according to claim 1.
The characteristic is an appearance frequency of the adjacent character string in the search target data,
The determining step determines that the adjacent character string is a character string constituting an excluded word candidate when a ratio or difference between the appearance frequency of the adjacent character string and the appearance frequency of the search word is within a predetermined range. ,
The information search method according to claim 1.
The number of the search terms included in the search target data, and the step of presenting one or a plurality of excluded word candidates determined in the determining step to the user;
Allowing the user to select an excluded word from the one or more output excluded word candidates;
Presenting to the user a number obtained by subtracting the number of selected excluded words from the number of search words included in the search target data;
The information search method according to claim 1, further comprising:
The search target data is tabular data having a plurality of rows and a plurality of columns,
The step of specifying the position where the search word is included specifies the position where the search word is included by searching the text stored in the first column of the plurality of columns. ,
The information search method according to claim 4.
The step of presenting the exclusion word candidate to the user includes:
Calculating the co-occurrence of the search term and the column value for each column value included in a column other than the first column;
Calculating a co-occurrence degree of the exclusion word candidate and the column value;
Determining a priority order of the excluded word candidates to be presented to a user based on the co-occurrence degree of the search word and the column value and the co-occurrence degree of the excluded word candidate and the column value;
The information search method according to claim 5, comprising:
Receiving a search term specification;
Identifying a position where the search term is included in the search target data;
Identifying adjacent character strings that are adjacent character strings before and after the position containing the search term;
Determining whether the adjacent character string is a character string that constitutes an excluded word candidate that is a word to be excluded from a search target, based on the characteristics of the adjacent character string;
When the adjacent character string is determined to be a character string that constitutes an excluded word candidate, determining a word obtained by connecting the adjacent character string and the search word as an excluded word candidate;
A computer-readable storage medium storing a program for causing a computer to execute the program.
The characteristic is a character type of the adjacent character string,
The step of determining is a step of determining, when the adjacent character string is the same character type as the search word, the adjacent character string as a character string constituting an excluded word candidate.
The computer-readable storage medium according to claim 7.
The characteristic is an appearance frequency of the adjacent character string in the search target data,
In the determining step, when the ratio or difference between the appearance frequency of the adjacent character string and the appearance frequency of the search word is within a predetermined range, the adjacent character string is determined as a character string constituting an excluded word candidate. Is,
The computer-readable storage medium according to claim 7.
The search target data is tabular data having a plurality of rows and a plurality of columns,
The step of specifying the position where the search word is included specifies the position where the search word is included by searching the text stored in the first column of the plurality of columns. ,
The computer-readable storage medium according to claim 7.
Calculating the co-occurrence of the search term and the column value for each column value included in a column other than the first column;
Calculating a co-occurrence degree of the exclusion word candidate and the column value;
Determining a priority order of the excluded word candidates to be presented to a user based on the co-occurrence degree of the search word and the column value and the co-occurrence degree of the excluded word candidate and the column value;
Presenting to the user the number of search terms included in the search target data and the excluded word candidates to be presented to the user;
The computer-readable storage medium according to claim 10, further executed by a computer.
Receiving a search term specification;
Identifying a position where the search term is included in the search target data;
Identifying adjacent character strings that are adjacent character strings before and after the position containing the search term;
Determining whether the adjacent character string is a character string that constitutes an excluded word candidate that is a word to be excluded from a search target, based on the characteristics of the adjacent character string;
When the adjacent character string is determined to be a character string that constitutes an excluded word candidate, determining a word obtained by connecting the adjacent character string and the search word as an excluded word candidate;
An information retrieval system that executes.
The characteristic is a character type of the adjacent character string,
The determining step is a step of determining, when the adjacent character string is the same character type as the search word, that the adjacent character string is a character string constituting an excluded word candidate.
The information search system according to claim 12.
The characteristic is an appearance frequency of the adjacent character string in the search target data,
In the determining step, when the ratio or difference between the appearance frequency of the adjacent character string and the appearance frequency of the search word is within a predetermined range, the adjacent character string is determined as a character string constituting an excluded word candidate. Is,
The information search system according to claim 12.
The search target data is tabular data having a plurality of rows and a plurality of columns,
The step of specifying the position where the search word is included specifies the position where the search word is included by searching the text stored in the first column of the plurality of columns. A process,
further,
Calculating the co-occurrence of the search term and the column value for each column value included in a column other than the first column;
Calculating a co-occurrence degree of the exclusion word candidate and the column value;
Determining a priority order of the excluded word candidates to be presented to a user based on the co-occurrence degree of the search word and the column value and the co-occurrence degree of the excluded word candidate and the column value;
Presenting to the user the number of search terms included in the search target data and the excluded word candidates to be presented to the user;
The information retrieval system according to claim 12, wherein: