WO2007105642A1

WO2007105642A1 - Information retrieval device by means of ambiguous word and program

Info

Publication number: WO2007105642A1
Application number: PCT/JP2007/054692
Authority: WO
Inventors: Masaki Murata; Kouichi Doi; Tomohiro Mitsumori; Yasushi Fukuda
Original assignee: National Institute Of Information And Communications Technology; National University Corporation NARA Institute of Science and Technology
Priority date: 2006-03-10
Filing date: 2007-03-09
Publication date: 2007-09-20
Also published as: CN101405725A; JP4857448B2; JP2007241794A

Abstract

An information retrieval device is to surely retrieve articles in a field input by means of an ambiguous keyword. The information retrieval device is comprised of an input means (1) for inputting a keyword and a field, a database (4) for storing articles in each field, a retrieval extracting means (3) for extracting an article including the input keyword and field from the database (4) and a word group (A) unevenly emerged in the extracted articles, and for outputting the articles containing the word groups (A) in the articles including the input keywords in order of those with the number of the word groups (A).

Description

Specification

Information retrieval apparatus and program using multiple meanings

Technical field

TECHNICAL FIELD [0001] The present invention relates to an information retrieval apparatus and program using an ambiguous word that performs a search in consideration of the ambiguous word. For example, the word “WINS” has two terms: computer terms and horse racing terms. If you search only by entering “WINS”, search results related to computer terms and search results related to horse racing terms will be mixed and output. If the user wants search results only for articles related to computer terms, the above search results are inconvenient and need to be resolved.

Background art

Conventionally, there has been a technique for performing information retrieval by providing a keyword for retrieval (see Non-Patent Document 1). However, at the search stage, it was not possible to input in consideration of the ambiguity of words. Non-patent document 1: "Information search using location information and field information" Shingo Murata, Ma Aoi, Kiyotaka Uchimoto, Hiromi Osaku , Masao Uchiyama, Hitoshi Isahara, Natural Language Processing (Journal of the Language Processing Society) April 2000, No. 7, No. 2, P.141- P.160

Disclosure of the invention

Problems to be solved by the invention

[0003] The above-described conventional technology for searching information by providing keywords has been difficult to input in consideration of word ambiguity at the search stage, and therefore, unnecessary information may be searched and output. It was.

[0004] An object of the present invention is to solve the above problems, perform a search in consideration of the ambiguity of words, and search (output) only necessary information.

Means for solving the problem

[0005] FIG. 1 is an explanatory diagram of an information retrieval apparatus using a polysemy of the present invention. In FIG. 1, 1 is an input section (input means), 2 is a search extraction section (search extraction means), 4 is a database (storage means), and 5 is an output section (output means). The present invention has the following means in order to solve the above conventional problems.

[0007] (1): An input means 1 for inputting a keyword and a field, a database 4 for storing articles in each field, and an article including the input keyword and field are extracted from the database 4, And a search and extraction unit 2 that extracts a word group A that appears biased to the extracted article group and outputs the extracted word group A in the order of the article power including the word group A among the articles including the input keyword. For this reason, it is possible to search for articles in the field entered using keywords with multiple meanings.

[0008] (2): An input means 1 for inputting a keyword and a field, a database 4 for storing articles in each field, and an article including both the input keyword and field are extracted from the database 4. There is provided search extraction means 2 for extracting similar articles of the extracted article group B, and extracting and outputting only articles containing the input keyword in the extracted similar articles. For this reason, it is possible to search for articles in the field entered using keywords with multiple terms.

[0009] (3): In the information retrieval apparatus using the multiple meanings of (2), the search extraction unit 2 extracts and outputs only the articles including the input keyword in the extracted similar articles. , Output in the order of the article power with the highest similarity to the article group B. Therefore, it is possible to reliably search for articles in the field entered using keywords with ambiguous terms.

[0010] (4): An input means 1 for inputting a keyword, a database 4 for storing articles in each field, an article including the input keyword is extracted from the database 4, and the extracted article group is clustered. Search / extraction means 2 for extracting expressions that appear biased in each cluster, and inquiry means for selecting expressions that appear unevenly in each cluster. The search / extraction means 2 is selected by the inquiry means. The article of the cluster of the expressed expression is output. This makes it easy to search for articles in fields where you want to enter only keywords.

[0011] (5): In the information retrieval device using the multiple meanings of (1) to (3), an article including the keyword input by the input means 1 and the input keyword by the search extraction means 2 is provided. Extract from the database 4 above, cluster the extracted articles, The question V is selected by extracting the expressions that appear and the expressions that appear biased in each cluster are selected. The expression selected by the matching means is used as the field input to the input means 1. This makes it easy to search for articles in a desired field by entering keywords.

[0012] (6): An input means 1 for inputting a keyword and a field, a database 4 for storing articles in each field, and an article including the input keyword and field are extracted from the database 4, Extract word group A that appears biased in the extracted article group, and make the computer function as search and extraction means 2 that outputs in order of article power that contains the word group A in the articles including the input keyword. Program. For this reason, by installing this program on a computer, it is possible to easily provide an information retrieval apparatus using a polysemy that can retrieve articles in a field input using a keyword using polysemy.

[0013] (7): An input means 1 for inputting a keyword and a field, a database 4 for storing articles in each field, and an article including both the input keyword and field are extracted from the database 4. A program for causing a computer to function as the search and extraction means 2 that extracts similar articles of the extracted article group B and extracts and outputs only articles including the input keyword in the extracted similar articles. For this reason, by installing this program in a computer, it is possible to easily provide an information retrieval apparatus using a polysemy that can search for articles in a field input using a keyword using polysemy.

[0014] (8): An input means 1 for inputting a keyword, a database 4 for storing articles in each field, an article including the input keyword is extracted from the database 4, and the extracted article group is clustered. Search and extraction means 2 for extracting expressions that appear biased in each cluster, inquiry means for selecting expressions that appear biased in each cluster, and outputting articles of clusters of expressions selected by the inquiry means The search extraction means 2 is a program for causing a computer to function. For this reason, by installing this program on a computer, it is possible to easily provide an information retrieval apparatus using a polysemy that can easily search for articles in a field in which only keywords are desired. Togashi.

The invention's effect

[0015] The present invention has the following effects.

[0016] (1): The search / extraction means extracts an article including the input keyword and field from the database, extracts a word group A that appears biased to the extracted article group, and includes the input keyword. Since the articles are output in the order of the article power that contains a large number of the word group A, it is possible to search for articles in the input field using keywords based on multiple terms.

[0017] (2): An article including both the input keyword and the field is extracted from the database 4 by the search extraction means, and similar articles of the extracted article group B are extracted. In the extracted similar articles, Since only articles containing the input keyword are extracted and output, articles in the input field can be searched using a keyword based on multiple terms.

[0018] (3): When extracting and outputting only the articles including the input keyword in the extracted similar articles by the search extraction means, the articles are output in order of the degree of similarity with the article group B. You can reliably search for articles in the field you entered using keywords with multiple meanings.

[0019] (4): An article including the input keyword is extracted from the database by the search extraction means, the extracted article group is clustered, expressions that appear biased in each cluster are extracted, and the inquiry means Select an expression that appears biased in each cluster, and the search extraction means outputs an article of the expression cluster selected by the above question and matching means. You can search easily.

[0020] (5): Extract articles including keywords input by the search extraction means, extract the group of the extracted articles, and extract expressions that appear biased in each cluster. In order to select an expression that appears biased in each cluster by the matching means, and to use the expression selected by the question and matching means as a field to be input to the input means, input a keyword, You can easily search for articles in the field.

Brief Description of Drawings

[0021] FIG. 1 is an explanatory diagram of an information retrieval apparatus using a polysemy of the present invention.

FIG. 2 is a flowchart (1) of information retrieval using a polysemy of the present invention. FIG. 3 is a flowchart (2) of information retrieval using a polysemy of the present invention.

4] It is explanatory drawing of the information retrieval apparatus by a polysemy provided with the inquiry part of this invention.

FIG. 5 is a flowchart (3) of information retrieval using a polysemy of the present invention.

Explanation of symbols

[0022] 1 Input section (input means)

2 Search extraction unit (Search extraction means)

4 Database (Storage method)

5 Output section (output means)

BEST MODE FOR CARRYING OUT THE INVENTION

[0023] The information retrieval apparatus using ambiguous words according to the present invention performs retrieval in consideration of word ambiguity in information retrieval. For example, the word “WINS” has two terms: computer terminology and horse racing terminology. If you search only by entering "WINS", search results related to computer terms and search results related to horse racing terms are mixed and output. If the user wants search results for only articles related to computer terms, the solution described below (Solutions 1 to 3) can be used.

[0024] (1): Description of information retrieval apparatus using multiple terms

FIG. 1 is an explanatory diagram of an information retrieval apparatus using ambiguous words. In FIG. 1, an information retrieval device (system) using multiple terms includes an input unit (input unit) 1, a search extraction unit (search extraction unit) 2, a database (storage unit) 4, and an output unit (output unit) 5. It is provided.

The input unit 1 is an input means for inputting information such as keywords. The search extraction unit 2 is a search extraction unit that performs word extraction, search processing, and the like. Database 4 is a storage means for storing information (including information such as Web). The output unit 5 is output means for outputting information by displaying and printing.

[0026] (2): Explanation of information retrieval using multiple terms 1 (Solution 1)

The user can enter the input form by specifying a field such as “keyword (field)”. For example, in the previous example, enter "WINS (computer)".

When this input is made, first, articles including “WINS” are extracted. Then, articles including computers are extracted from the article group. Among the articles including “WINS”, Extract word group A that appears biased to the article group including Uta. Articles that contain more word group A in articles containing rwiNSj are also output in order. The word group A is an expression that frequently appears in articles in the computer-related field, and an article in which many such expressions appear is expected to be an article in the computer-related field. The problem is solved by outputting such an article.

[0028] (Description by flowchart)

FIG. 2 is a flowchart (1) of information retrieval using ambiguous words. In the following, information retrieval using a multiple word (Solution 1) will be described according to the processes S1 to S5 in FIG.

[0029] S1: The user inputs a keyword by designating a field using the input unit 1, and proceeds to processing S2.

[0030] S2: The search extraction unit 2 extracts an article including the keyword input from the database 4, and proceeds to processing S3.

[0031] S3: The search and extraction unit 2 extracts an article including the specified field from the extracted article group, and proceeds to processing S4.

[0032] S4: The search extraction unit 2 extracts a word group A that appears biased to the article group including the specified field from the article group including the input keyword, and proceeds to processing S5.

[0033] S5: The search extraction unit 2 outputs to the output unit 5 in the order of the article power including more word group A in the articles including the input keyword.

[0034] a) Explanation 1 of the extraction method of word group A that appears biased to a certain article group B (Solution 1)

For example, it can be used to extract word group A that appears biased to articles including computers. Let C be a larger article group that contains article group B. Here, the article group C may be the whole database or a part thereof. According to Solution 1 above, the article group includes ¾ “WINS”.

[0035] However, Solution 1 described above may have other methods. Among the articles that include "WINS", the database that does not extract word group A that appears biased in the articles that include the computer is not included. The word group A that appears biased in the article group including the computer may be extracted from the entire article group, and processed using the extracted word group A. In that case, C is the entire database.

[0036] First, the appearance rate of A in C and the appearance rate of A in B are obtained.

[0037] Appearance rate of A in C = Number of occurrences of A in C Total number of words in ZC Appearance rate of A in B = Number of occurrences of A in B Total number of words in ZB

Next, the appearance rate of A in B The appearance rate of A in ZC

The higher this value is, the more the word appears in the article group B.

[0038] b) Explanation of extraction method of word group A that appears biased to a certain article group B 2

(Explanation using significant difference test)

'Explanation for binomial test

Let N be the number of occurrences of A in C. Let N1 be the number of occurrences of A in B.

[0039] N2 = N-N1.

[0040] Of the total occurrences of N, assuming that the probability that it appears in B when A appears in C is 0.5

N2 times or less, find the probability that A appeared in C and did not appear in B.

[0041] This probability is

PI = ∑ C (N1 + N2, x) * 0.5 "(x) * 0.5 '(N1 + N2— x)

(However, ∑ is the mouth of x = 0 force and x = N2)

(However, C (A, B) is the number for extracting B objects from A different objects.) (However, 'means exponent.)

If this probability value is sufficiently small, it can be determined that N1 and N2 are not equivalent probabilities, that is, N1 is significantly larger than N2.

[0042] 5% test

Whether P1 is less than 5%, or 10% test, P1 is less than 10% is a criterion for determining whether it is significantly greater.

[0043] Words that appear to be biased in the article group B are those in which N1 is determined to be significantly larger than N2. In addition, the smaller P1, the more often the word appears in the article group B.

[0044] 'Explanation for Chi-square test

The number of occurrences of A in B is Nl, the total number of occurrences of words in B is Fl,

The number of occurrences of A that is in C but not in B is N2,

Let F2 be the total number of words that are in C but not in B.

[0045] As N = N1 + N2,

Chi-square value = (N * (Fl * (N2-F2)-(N1-Fl) * F2) "2) / ((Fl + F2) * (N— (Fl + F 2)) * N1 * N2)

Ask for.

[0046] R1 and R2 are more significant as the chi-square value is larger. When the chi-square value is greater than 3.84, it can be said that there is a significant difference of 5%, and the chi-square value is 6.63. If it is too large, it can be said that there is a significant difference of 1%.

[0047] As Nl> N2 and the larger the chi-square value, the word appears more biased in the article group B.

[0048] · Explanation of ratio test, more precisely, ratio difference test

p = (F1 + F2) / (N1 + N2)

pi = Rl

p2 = R2

As

Z = I pi-p2 I / sqrt (p * (1-p) * (1 / Nl + 1 / N2))

(Where sqrt means the root) and the larger Z, the more significant the difference between R1 and R2, and when Z is greater than 1.96, there is a significant difference of 5%. When is greater than 2.58, it can be said that there is a significant difference of 1%.

[0049] As Nl> N2 and Z is larger, it is assumed that the word appears more biased in the article group B.

[0050] These three test methods may be combined with the method of simply determining the appearance rate of A in B and the appearance rate of A in ZC.

[0051] For example, among those that have a significant difference of 5% or more in the risk rate, the appearance rate of A in B and the appearance rate of A in ZC are larger! A word.

[0052] c) Explanation of how to extract articles containing more words A (Solution 1)

There is the following formula as basic knowledge of information retrieval. Here, a score having a large Score (D) is taken.

[0053] (1) Explanation of basic method (TF · IDF method)

score (D) = ∑ (tl (w, D) * log (N / dl (w))

w Add in £ W

W is a set of keywords entered by the user

(tw, D) is the number of occurrences of w in document D dw) is the number of documents in which W appears in all documents

N is the total number of documents

Documents with high score (D) are output as search results.

[0054] (2) Explanation of Robertson et al. Okapi weighting

(Reference)

Murata Shingo, Ma Aoi, Uchimoto Kiyotaka, Osaku Hiromi, Uchiyama Masao, Isahara Hitoshi "Information Retrieval Using Location Information and Field Information" Natural Language Processing (Journal of the Language Processing Society) April 2000, 7th No.2, p.141〜 P.160

(1) is known to have good performance. The product of the case term and idf term before taking the product with ∑ in Eq. (1) becomes Okapi's weighting method, and this value is used as the word weight.

[0055] Okapi formula

score (D) = ∑ (tl (w, D) / (ti (w, D) + length / delta) * log (N / dl (w)))

w Add in £ W

length is the length of article D, delta is the average length of articles,

The length of the article uses the number of bytes of the article and the number of words included in the article.

[0056] Further, the following information search can be performed.

[0057] (Okapi reference)

S. E. Robertson, b. Walker, b. Jones, M. M. Hancock— Beaulieu, and M. uatfor d Okapi at TREC— 3, TREC— 3, 1994

(SMART reference)

Amit Singhal AT & T at TREC— 6, TREC— 6, 1997

As a more advanced information retrieval method, these Okapi and S

You can use the MART formula!

[0058] In these methods, it is possible to perform more accurate information retrieval by using the length of an article that is just tf'idf.

[0059] In this method of extracting articles containing more words A, Rocchio's formula can also be used.

[0060] (Reference) "]. J. Rocchio", "Relevance feedback in information retrieval", "The SMART retri eval System", "Edited by G. Salton", "Prentice Hall, Inc., page 313-323〃, 1971 instead of log (N / dw))

{E (t) + k_af * (RatioC (t)-RatioD (t))} * log (N / dl (w))

use.

[0061] E (t) = 1 (keyword from the original search)

= 0 (otherwise)

RatioC (t) is the appearance rate of t in article group B

RatioD (t) is the appearance rate of t in article group C

The score (D) is obtained by replacing the log (N / d w)) with the above equation, and the larger the value! /, the more the word group A is extracted.

[0062] The set W of words w to be added when score (D) is added is both the original keyword and the word group A. However, the original keyword and word group A should not overlap.

[0063] As another method, score (D) is added at the time of addition. The set W of words w is only word group A. However, the original keyword and word group A should not overlap.

[0064] Here, the power of a complicated method using roccio's formula. Simply, the larger the sum of the number of occurrences of words in word group A, the more the word group A may be extracted. However, the larger the difference in the appearance of the word group A, the more the word group A may be taken out as an article.

[0065] (3): Explanation of information retrieval using multiple terms 2 (Solution 2)

The user can enter the input form by specifying a field such as “keyword (field)”. For example, in the previous example, enter "WINS (computer)". When this input is made, articles containing both “WINS” and the computer are first extracted. Then, similar articles in the article group B are extracted. In the similar articles, only articles that contain “WINS” are extracted and output as search results. At this time, articles with high similarity to article group B are output. This also seems to be able to extract articles in the computer-related field.

[0066] (Description by flowchart)

FIG. 3 is a flowchart (2) of information retrieval using multiple terms. Hereafter, the process Sl l ~ in Fig. 3 In accordance with S14, explain information retrieval by using multiple meanings (Solution 2).

S11: Using the input unit 1, the user inputs a keyword specifying a field, and proceeds to processing S12.

S12: The search extraction unit 2 extracts articles including both the keyword and the field input from the database 4, and proceeds to processing S13.

[0069] S13: The search extraction unit 2 extracts similar articles in the extracted article group B, and proceeds to processing S14.

[0070] S14: The search extraction unit 2 extracts only the articles including the input keyword in the extracted similar articles, and outputs them as search results. At this time, it is output to the article power output unit 5 having a high similarity to the article group B.

[0071] a) Explanation of a method for extracting similar articles of article group B (Solution 2)

Define the similarity between articles. Use tf'idf, okapi, or smart for this similarity. The two articles x and y that compare the query with article D in tf'idf, okapi, smart, etc. may be used. The word ^ w contained in both x and y is good.

[0072] Create a vector with each word as a dimension and the score of each word as an element, and change the vector of article X to a vector (vector— x) using the words contained in article X, and the vector of article y May be made into a vector (vector—y) using the words contained in the article y, and the value of the cosine (cos (v ector _x, vector_y)) of these vectors may be used as the similarity of the article. Use tf'idf, okapi, or smart to calculate the score for each word. The formula behind the heel of those formulas is the formula for calculating the score. The value of the expression is the score for each word.

[0073] tf'idf, t w, D) * log (N / d w))

okapi t w, D) / (t w, D) + length / delta) * log (N / dl (w))

Is the formula.

[0074] The cosine of this vector (cos (vector

The value of _x, vector_y)) is obtained, and an article with a larger value may be determined as an article containing more word group A. In this case, the word contained in the word group A is used as a vector (vector_x), and the word contained in the article is used as a vector (vector—y).

[0075] The similarity between the article group B and the article X includes the following methods.

[0076], in article group B, the article most similar to article X and the similarity of article X is the similarity Method

• How to make the similarity of article x and the similarity of article x in article group B most similar to article x

• A method in which the average of the similarities between all the articles in Article Group B and Article X is used as another similarity method. However, in this way, the similarity between Article Group B and Article X is obtained. The similarity is large, and articles can be extracted as similar articles.

[0077] As another method, a word that appears biased in the article group B is extracted by the previous method, and the score (D) based on the Rocchio's formula is calculated using the word, and Score ( D) You can take out the article as a similar article!

[0078] (4): Explanation of information retrieval using multiple terms 3 (Solution 3)

The user inputs only “keyword”. For example, in the previous example, “WINS” is entered. When this input is made, first, articles including “WINS” are extracted. Then, the articles are clustered. Extract expressions that appear biased in each cluster. For example, suppose that the expressions that are divided into two clusters and appear in each cluster are “computer” and “horse racing”, respectively. In that case, the user is inquired about whether it is related to “computer” or “horse racing”. Then, the user selects one of these. After the selection, the selected expression is processed as the input “field” in the same manner as in the above solutions 1 and 2, or the selected cluster is output as a search result.

[0079] (Description of Information Retrieval Device Using Polysemy with Inquiry Unit)

FIG. 4 is an explanatory diagram of an information retrieval apparatus using a multiple word having an inquiry unit. In FIG. 4, an information retrieval device (system) with a multiple meaning including an inquiry unit includes an input unit (input unit) 1, a search extraction unit (search extraction unit) 2, an inquiry unit (inquiry unit) 3, a database ( (Storage means) 4 and output unit (output means) 5 are provided.

The input unit 1 is an input means for inputting information such as keywords. The search extraction unit 2 is a search extraction unit that performs word extraction, search processing, and the like. The inquiry unit 3 is an inquiry means that asks the user for expressions (technical fields, etc.) that appear biased in the cluster, and makes selections by the user. The database 4 is a storage means for storing information. The output unit 5 is an output unit that outputs information by performing display and printing. [0081] (Description by flowchart)

FIG. 5 is a flowchart (3) of information retrieval using a polysemy. Hereinafter, information retrieval (solution 3) using a multiple meaning word having an inquiry part will be described according to the processes S21 to S26 in FIG.

S21: The user inputs only the keyword through the input unit 1, and the process proceeds to processing S22.

S22: The search extraction unit 2 extracts an article including the keyword input from the database 4, and proceeds to processing S23.

S23: The search extraction unit 2 clusters the extracted article group, and proceeds to processing S24.

S24: The search extraction unit 2 extracts expressions that appear unevenly in each cluster, and proceeds to processing S25.

[0086] S25: The inquiry unit 3 inquires the user so as to select an expression that appears biased in each cluster, and proceeds to processing S26.

S26: The search extraction unit 2 outputs the articles of the selected cluster to the output unit 5.

[0088] a) Explanation of clustering (Solution 3)

There are various methods for clustering. The general ones are described below.

[0089] (Description of hierarchical clustering (bottom-up clustering))

The closest members are brought together to form a cluster. Clusters and clusters (both clusters and members) are closest to each other.

Since there are various definitions of the distance between clusters, this will be described below.

[0090] 'Method of setting the distance between cluster A and cluster B to be the smallest distance between the members of cluster A and cluster B

'The distance between cluster A and cluster B is the largest distance between the members of cluster A and cluster B, and the distance is the largest

'How to set the distance between cluster A and cluster B to be the average of the distances of all cluster A members and cluster B members

The distance between cluster A and cluster B is the average of all cluster A member positions, and the average of all cluster B member positions is the single cluster position. The average is the distance • There is also a method called the Ward method. Hereinafter, the Ward method will be described.

[0091] W = ∑ ∑ (x (i, j)-ave _x (i)) "2

It means cocoon index.

[0092] The first trap is the addition from i = l to i = g

The second trap is the addition from j = l to j = ni

x (i, j) is the position of the j-th member of the i-th cluster

ave— x (i) is the average of the positions of all members of the i-th cluster

When the clusters are brought together, the force that increases the value of W. In the Ward method, the values of W must be as large as possible. /

[0093] The position of the member is the word taken from the article, the type of the word is taken as the dimension of the vector, and the value of the vector element of each word is set to the word frequency or the word 'idf (ie, tKw, D ) * log (N / dw) >> and the Okapi formula for that word (ie tl (w, D) / (ti (w, D) + length / delta) * log (N / dw)) Create and make it a member's position.

[0094] (Description of top-down clustering (non-hierarchical clustering))

Hereinafter, a method of top-down clustering (non-hierarchical clustering) will be described.

[0095] (Explanation of maximum distance algorithm)

Take a member. Next, take the member farthest from that member. These members will be the center of each cluster. The minimum distance between each cluster center and the member is taken as the distance of each member, and the member with the largest distance is the center of the new cluster. Repeat this. When the number of clusters has been determined by force, stop repeating. Also, the repetition is stopped when the distance between the clusters is less than a predetermined number. Another method is to evaluate the goodness of the cluster based on the AIC information criterion and stop the repetition using that value. Each member becomes the closest cluster-centered member.

[0096] (Explanation of K-means method)

Consider clustering to a predetermined number k. Choose k members randomly, and use it as the center of the cluster. Each member becomes the closest cluster-centered member. The average of each member in the cluster is the center of each cluster. Each member becomes the closest cluster-centered member. In addition, the average of each member in the cluster The center of the raster. Repeat these. When the center of the cluster stops moving, it stops repeating. Or, repeat it for a predetermined number of times. The cluster is obtained by using the cluster center at the final cluster center. Each member is most recently a cluster-centered member.

In this way, clustering is performed. There are many other clustering methods that can be used.

[0098] b) Explanation of extraction of expressions that appear biased in each cluster (Solution 3)

It can be extracted in the same way as “Explanation 1 (Solution 1) for Extracting Word Group A that Appears Partly in a certain Article Group B”.

[0099] More simply, for each cluster, powerful words that appear only in that cluster may be arranged in order of frequency and extracted as an expression that appears biased in each cluster.

[0100] (5): Explanation of using multiple keywords

Regarding the above solutions 1 and 2, the keyword given first may be plural, such as the force A B (B ′) C (C,) which is “WINS (computer)”. This means an AND search of word A, word B (but word B in the case of field B ') and word C (but word C in the case of field C').

[0101] a) Explanation by Solution 1

If this is done with Solution 1, the group of articles X containing A, B, and C is extracted. Next, an article group X ′ including B ′ and C ′ is extracted from the article group X. From the article group X, the word group Y that appears biased to the article group X 'is extracted. Then, out of the article group X, articles containing many word groups Y are extracted and output.

[0102] b) Explanation by Solution 2

If this is done with Solution 2, the article group X containing A, B, B ', C, and C' is extracted. Next, similar articles in Article Group X are extracted. Extract similar articles that include A, B, and C and output them.

[0103] c) Explanation by Solution 3

Solution 3 is also possible. First, enter A, B, and C. Next, take out articles including A, B, and C. Clustering and outputting word Z that appears biased to each cluster. Simple The user can select a word and process the selected expression as the “field” of input in the same way as in solutions 1 and 2 above, or output the selected cluster as a search result.

[0104] Further, in Solution 3, it is better to show the word group Z that appears biased in each cluster in association with the input A, B, and the same.

For example, it is assumed that the word group Z is Z1, Z2, Z3,. Zl, Z2, Z3, ... to A,

It may be shown close to what often co-occurs with B and C.

[0106] Z1 co-occurs well with A, Z2 co-occurs with C, and Z3 co-occurs with B

Cluster 1 A Zl, B Z3, C Z2

Display as cluster 2 and let the user choose Zl, Z2, Z3, .. Let the user choose a cluster. This display may take other forms as long as the relation between the input keyword and Zl, Z2,.

[0107] Whether Z1 often co-occurs with A is as follows.

[0108] · Assume that the more articles with both Ζ1 and A appear, the more often they co-occur.

[0109] · When the bias recognition method described above is used and it is determined that A appears well in the article including Z1, it is assumed that it often co-occurs.

[0110] · The number of articles where both Ζ1 and A appear is a, the number of articles where only Zl appears is b, the number of articles where only A appears is c, and the total number of articles is d,

a

2a / (2a + b + c)

n (ad-bc) "2 / (a + b) / (c + d) / (a + c) / (b + d)

n (I ad— be I -n / 2) "2 / (a + b) / (c + d) / (a + c) / (b + d)

log (an / (a + b) / (a + c))

(ad -bc) / ((a + c) (b + d)) "0.5

a log (an / (a + b) / (a + c)) + b log (bn / (a + b) / (b + d)) + _c log (cn / (a + c) / (c + d)) + d log (dn / (b + d) / (c + d)) a / (bc + ad)

a / (ad-be)

a / b / c

It is assumed that a large value such as (couses one of these equations) often co-occurs.

[0111] Z1 often co-occurs with A.

[0112] In the above-described embodiment, the process described as "taken out as the value is larger" can be taken out as "take out a value whose value is equal to or greater than the threshold value". In addition, the processing described as “take out a larger value in the order of the number greater than a predetermined value in order,” obtains a value obtained by multiplying the maximum value of the extracted value by a predetermined ratio, and “Take out the one with a value that is equal to or greater than the calculated value”. Furthermore, these threshold values and predetermined values can be determined in advance, or the values can be appropriately changed and set by the user.

[0113] (9): Explanation of program installation

Input section (input means) 1, search extraction section (search extraction means) 2, question, matching section (question, matching means) 3, database (storage means) 4, output section (output means) 5, etc. are composed of programs It is executed by the main control unit (CPU) and is stored in the main memory. This program is processed by a general computer (information processing apparatus). This computer is composed of hardware such as an input device as input means such as a main control unit, main memory, file device, display device, and keyboard.

[0114] The program of the present invention is installed in this computer. In this installation, these programs are stored in a portable recording medium such as a hard disk or a magneto-optical disk, and the drive for accessing the recording medium provided in the computer is used. It is installed in a file device provided in the computer via a device or a network such as a LAN. Then, the program steps necessary for the file device power processing are read out to the main memory and executed by the main control unit.

Claims

The scope of the claims

[1] Input means for entering keywords and fields;

A database that stores articles in each field;

The article including the input keyword and field is extracted from the database, the word group A that appears biased in the extracted article group is extracted, and the word group A is increased among the articles including the input keyword. An information retrieval apparatus using a polysemy characterized by retrieval extraction means for outputting in the order of the included article power.

[2] Input means for entering keywords and fields;

A database that stores articles in each field;

The article including both the input keyword and the field is extracted from the database, the similar articles of the extracted article group B are extracted, and only the articles including the input keyword are extracted in the extracted similar articles. An information retrieval device using a polysemy characterized by comprising retrieval retrieval means for outputting.

[3] In the extracted similar articles, when the extracted similar articles extract and output only the articles including the input keyword, the search extracting means outputs the articles in order of article power having a high similarity to the article group B. 3. An information retrieval apparatus using a polysemy according to claim 2.

[4] Input means for entering keywords;

A database that stores articles in each field;

A search and extraction means for extracting articles including the input keyword from the database, clustering the extracted article groups, and extracting expressions that appear biased in each cluster;

Inquiry means for selecting an expression that appears unevenly in each cluster, and the search extraction means outputs an article of the cluster of the expression selected by the inquiry means. .

[5] A keyword is input to the input unit, an article including the input keyword is extracted from the database by the search and extraction unit, the extracted article group is clustered, and expressions that appear unevenly in each cluster are extracted. And

Inquiry means for selecting expressions that appear biased in each cluster, 4. The information retrieval apparatus using a polysemy according to claim 1, wherein an expression selected by the inquiry unit is used as a field to be input to the input unit.

[6] An input means for entering keywords and fields;

A database that stores articles in each field;

The article including the input keyword and field is extracted from the database, the word group A that appears biased in the extracted article group is extracted, and the word group A is increased among the articles including the input keyword. A program that causes a computer to function as a search and extraction means that outputs in the order of article power.

[7] Input means for entering keywords and fields;

A database that stores articles in each field;

The article including both the input keyword and the field is extracted from the database, the similar articles of the extracted article group B are extracted, and only the articles including the input keyword are extracted in the extracted similar articles. A program for causing a computer to function as a search and extraction means for outputting.

[8] An input means for entering keywords;

A database that stores articles in each field;

Inquiry means for selecting expressions that appear unevenly in each cluster, and the search extraction means for outputting articles of clusters of expressions selected by the inquiry means,

A program that allows a computer to function.