WO2010061537A1 - 検索装置、検索方法、及びプログラムが格納された記録媒体 - Google Patents
検索装置、検索方法、及びプログラムが格納された記録媒体 Download PDFInfo
- Publication number
- WO2010061537A1 WO2010061537A1 PCT/JP2009/005907 JP2009005907W WO2010061537A1 WO 2010061537 A1 WO2010061537 A1 WO 2010061537A1 JP 2009005907 W JP2009005907 W JP 2009005907W WO 2010061537 A1 WO2010061537 A1 WO 2010061537A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- document
- area
- frequency
- upper limit
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
Definitions
- the present invention relates to a search device, a search method, and a recording medium in which a program is stored, which receives a query representing a subset of a document set stored in a document database and outputs keywords that frequently appear in the subset. .
- a feature word search device has been developed to extract necessary information from a large amount of documents.
- the simplest way to search for feature words is to read each document in turn based on the list of input document numbers, count the number of words contained in the document, and characterize high-frequency words.
- this document reading process is random access, and it is necessary to repeatedly read document data, so that there is a problem that the search speed is slow.
- an approach of sampling a document to be read and reading only a part of the document can be considered, but this method has a problem that the accuracy is greatly lowered.
- Patent Document 1 compresses a list of words appearing in a document using a document number as a key, and stores the compressed list in a memory as document word association data.
- a search system for performing a search is disclosed. Since the search system disclosed in Patent Document 1 can refer to a string of words included in an input document list at high speed by using data on a memory, it can return related words at high speed.
- Patent Document 2 includes a search system including, as components, a frequency order index obtained by rearranging transposed indexes of words included in a document set in order of frequency, and means for receiving an inquiry about the frequency order index. Disclosure.
- the search system disclosed in Patent Document 2 receives an inquiry, it first reads the frequency order index in order from the top (in order of high frequency words). Next, the search system compares the list of document numbers for each word with the input document list, and determines the frequency of each word in the document set specified by the input document list.
- the frequency f (k) of the read k-th word is larger than the frequency in the word document set (all document sets to be searched) in the frequency order index to be read next. It ends when it becomes.
- the reading process is performed in the same order every time based on the frequency order index, sequential access of the reading process can be realized. Therefore, according to the search system disclosed in Patent Document 2, it is considered that the search speed can be improved.
- Patent Document 1 has a problem that it can handle only documents that can be stored in the memory because document word association data needs to be stored in the memory. In order to increase the amount of documents, it is necessary to greatly expand the memory capacity.
- An object of the present invention is to provide a search medium, a search method, and a recording medium storing a program, which can solve the above problems and can speed up the search for the document set even if the search target is a large document set. It is to provide.
- a search device for searching a characteristic word for a subset from a document set, A summary matrix storage unit, an area upper limit calculation unit, a word frequency calculation unit, and a document frequency reference unit,
- the summary matrix storage unit includes: When the information representing the subset included in the document set is given from the plurality of regions obtained by dividing the matrix expressing the co-occurrence relationship between the word set and the document set, Information that enables calculation or prediction of the frequency of words in each of a plurality of areas is stored as summary information, The region upper limit calculation unit When the information representing the subset is input, the relationship between the information representing the subset and the plurality of regions is examined, and from the obtained result, the summary information for each of the plurality of regions is referred to, and For each of the plurality of regions, calculate the upper limit of the frequency for the subset of the words included in each region, The word frequency calculation unit The upper limit of the frequency for each of the plurality of regions is summed for each region where
- a search method in the present invention is a search method for searching a characteristic word for a subset from a document set
- (A) A case where information is generated that is created from a plurality of areas obtained by dividing a matrix expressing a co-occurrence relationship between a word set and a document set, and represents a subset included in the document set. Storing, as summary information, information that enables calculation or prediction of the frequency of words in each of the plurality of regions, (B) When information representing the subset is input, the relationship between the information representing the subset and the plurality of areas is examined, and the summary information for each of the plurality of areas is referred to based on the obtained result.
- a recording medium storing a program is a recording medium storing a program for causing a computer to search a word characteristic of the subset from a document set,
- (A) A case where information is generated that is created from a plurality of areas obtained by dividing a matrix expressing a co-occurrence relationship between a word set and a document set, and represents a subset included in the document set.
- Steps to set (D) A search target area is obtained based on the upper limit of the frequency of words for each area where the words are common, and a set number of words is determined based on the obtained search target area. Specifying in descending order, and outputting the specified words as words characteristic of the subset.
- the search device As described above, according to the search device, the search method, and the recording medium storing the program in the present invention, it is possible to narrow down the data that must be read at the time of search even if the search target is a large document set.
- the feature word can be calculated at high speed.
- FIG. 1 It is a figure which shows the example of the "document” in this invention. It is a figure which shows an example of the "word document matrix” in this invention. It is a figure which shows the example which divided
- FIG. 14 is a flowchart specifically showing Step P12 shown in FIG. 13. It is a flowchart which shows step P1201 shown in FIG. 14 in detail. It is a figure which shows an example of a word document list.
- step P1202 shown in FIG. 14 in detail. It is a figure which shows an example of the correspondence table which a vector summary preparation part produces. It is a flowchart which shows the process in the search process performed with the search method of Embodiment 1 of this invention. It is a flowchart which shows an example of the specific example of step P21 shown in FIG. It is a flowchart which shows the other example of the specific example of step P21 shown in FIG. It is a flowchart which shows the specific example of step P2111 shown in FIG. It is a flowchart which shows the specific example of step P23 shown in FIG. It is a flowchart which shows specifically step P2303 shown in FIG.
- FIG. 2 It is a block diagram which shows the structure of the search device in Embodiment 2 of this invention. It is a flowchart which shows the process in the search process performed with the search method of Embodiment 2 of this invention. It is a figure which shows the example of the word upper limit list
- FIG. 1 is a diagram showing an example of a “document” in the present invention. In the example of FIG. 1, document numbers and texts for seven documents are shown.
- a “word” is defined as a character string cut out from the text in a document according to some standard such as morphological analysis or N-gram (character string is divided into N characters).
- words include not only words appearing in the text (for example, “apples” and “gasoline”), but also words and phrases related to meanings recognized in documents by morphological analysis and semantic analysis. (Eg, “fruit” or “fuel”) may be included.
- FIG. 2 is a diagram showing an example of a “word document matrix” in the present invention.
- “word document matrix” for the seven documents shown in FIG. 1 is shown in a black line frame.
- the word document matrix in FIG. 2 represents words appearing in the document set in each row and each document in each column. Each line indicates whether or not the corresponding word appears in each document (1 when it appears, 0 when it does not appear). For example, the word “suspect” on the first line appears in Document 1, Document 3, Document 5, and Document 7. The word “gasoline” in the second line appears in Document 2, Document 3, Document 4, and Document 6.
- the “frequency” in the document set of words is defined as the number of documents that include the word in the document set. For example, the “frequency” in the document set of the seven documents shown in FIG. Further, in the following explanation, a high-frequency word or a high-frequency word may be used for a word with a high “frequency”.
- the search device searches a set of documents (document set) Dall and further represents at least a subset of the document set Dall (document set D), for example, a list of document numbers (hereinafter referred to as input). Document list) is input. Then, the search device according to the present invention outputs a list of the top k words having a high frequency in the document set D from the word set W appearing in the document set D.
- a search device is also called an associative search engine.
- such a search device is useful for searching high-frequency words representing a concept closely related to a document set expressed by a query in a text mining system or a search engine.
- FIG. 3 is a diagram showing an example in which the word document matrix shown in FIG. 2 is divided.
- FIG. 3 shows an example in which the word document matrix for the document set shown in FIG. 1 is divided into four parts by dotted lines.
- An “area” is a word document matrix obtained by dividing a document set into a plurality of subsets and classifying the word set into a plurality of subsets in the word document matrix, and divided in the vertical and horizontal directions. Mean part.
- the number of word areas in the word document matrix WD is m
- the number of document areas is n.
- a set of areas may be represented as C, and each area may be represented as C [i] [j] ⁇ 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n ⁇ .
- the word document matrix is divided into four regions C [0] [0], C [0] [1], C [1] [0], and C [1] [1]. Yes.
- the word region means a part of the word document matrix generated by classifying only the word set into a plurality of subsets.
- a set of word areas for a certain word document matrix may be expressed as WC, and each word area may be expressed as WC [i] ⁇ 0 ⁇ i ⁇ m ⁇ .
- a word area composed of C [0] [0] and C [0] [1] is a word area composed of WC [0], C [1] [0] and C [1] [1].
- Document area means a part of a word document matrix generated by classifying only a document set into a plurality of subsets.
- a set of document areas may be expressed as DC, and each document area may be expressed as DC [j] ⁇ 0 ⁇ j ⁇ n ⁇ .
- a document area composed of C [0] [0] and C [1] [0] is composed of DC [0], C [0] [1], and C [1] [1].
- the document area can be defined as DC [1].
- region frequency for the region C [i] [j] of the word W in a document set is defined as the number of documents in which the word W appears in the region C [i] [j]. For example, in the example of FIG. 3, the region frequency in the region C [1] [0] for the word “gasoline” is 1, and the region frequency in the region C [1] [1] is 3.
- the “region frequency” of the word W in the word region WC [i] is defined as the number of documents in which the word W appears in the word region WC [i].
- the “region frequency” of the word W in the document region DC [j] is similarly defined as the number of documents in which the word W appears in the document region DC [j].
- the region frequency for the region C [i] [j] of the word W in the entire document set Dall in the related word search apparatus is “static region frequency”, and the region C [i] of the word W in the input document set D ]
- the region frequency for [j] is called dynamic region frequency.
- FIG. 4 is a block diagram showing a configuration of the search device according to Embodiment 1 of the present invention.
- the search device includes a summary matrix storage unit 1, an area upper limit calculation unit 2, a word frequency calculation unit 3, and a document frequency reference unit 4, thereby A word characteristic of the subset is retrieved from the document set.
- the summary matrix storage unit 1 stores summary information.
- the summary information is created from a plurality of areas obtained by dividing a matrix that expresses a co-occurrence relationship between the word set Wall and the document set Dall.
- the summary information is information that enables calculation or prediction of the frequency of words in each of a plurality of areas when information representing a subset D included in the document set Dall is given.
- the region upper limit calculation unit 2 checks the relationship between the information representing the subset D and a plurality of regions. Then, the area upper limit calculation unit 2 refers to the summary information for each of the plurality of areas from the obtained result, and calculates the upper limit of the frequency for the subset D of the words included in each of the plurality of areas. .
- the word frequency calculation unit 3 adds the upper limit of the frequency for each of the plurality of regions for each region where the words are common, and uses the obtained sum as the upper limit of the word frequency for each region where the words are common.
- the document frequency reference unit 4 obtains an area to be searched based on the upper limit of the word frequency for each area in which words are common. Then, the document frequency reference unit 4 identifies the set number of words in descending order based on the obtained search target area, and outputs the identified words as characteristic words in the subset D.
- the word document matrix shown in FIG. 2 can be cited as a matrix expressing the co-occurrence relationship.
- Each region includes the region C [i] [j] shown in FIG.
- the above-mentioned “word area” is an example of the area where words are common.
- the “frequency” of a word document set is the number of documents that include the word in the document set.
- the above-described input document list can be cited.
- the search device uses the input document list and the summary information created for each region in the word document matrix at the time of the search, and uses the upper limit of the frequency of words included in each region. Determine. Furthermore, the search device determines the area to be searched, that is, the area of the word to be read, by collecting the upper limit of the frequency for each word area. For this reason, according to the search device in the first embodiment, it is possible to avoid reading processing for useless word regions, and speeding up of the search is achieved.
- the search apparatus further includes a cluster creation unit 5, a region summary creation unit 6, and a word document matrix storage unit 7.
- the word document matrix storage unit 7 is a database including a list of words extracted from the document set Dall and an arbitrary data structure that holds the word document matrix or information that is semantically equivalent thereto.
- FIG. 5A is a diagram showing an example of a data structure stored in the word document matrix storage unit shown in FIG.
- FIG. 5B shows a data structure called an inverted index.
- the transposed index includes a word table and a word index.
- the word table is a table in which an identifier (word number assigned to each word) for identifying the word and the word is specified together with the corresponding word.
- the word index is an index in which the document number of a document including the word specified by the word number is specified using the word number as a key.
- the cluster creation unit 5 receives the word document matrix (the word table and the word index shown in FIGS. 5A and 5B in the first embodiment) stored in the word document matrix storage unit 7 as an input, and generates a two-dimensional document and word. Perform clustering processing. By this clustering process, a set of documents in the word document matrix is divided into a plurality of document areas, and a set of words in the word document matrix is divided into a plurality of word areas.
- the cluster creation unit 5 outputs an identifier (document region number) representing a document region including each document to the region upper limit calculation unit 2 and the region summary creation unit 6. Further, the cluster creation unit 5 outputs an identifier (word region number) representing a word region including each word to the document frequency reference unit 4 and the region summary creation unit 6.
- the two-dimensional clustering process by the cluster creation unit 5 is performed based on co-occurrence information of words and documents in the word document matrix, at least receiving a word document matrix as input.
- a clustering process for dividing a set of words into a plurality of word areas and a clustering process for dividing a set of documents into a plurality of document areas are performed.
- the “clustering process” is a process for dividing a set of objects into a plurality of subsets (referred to as “clusters”). In the clustering process, clusters are generated so that similar objects enter the same cluster, and different objects enter different clusters.
- FIG. 6A and 6B are diagrams showing examples of output by the cluster creation unit 5 shown in FIG. 1, FIG. 6A shows a word area list, and FIG. 6B shows a document area list.
- the word area list shown in FIG. 6A is a table representing the relationship between each word number and the word area number. This indicates in which word area the word specified by the word number falls in the word area number.
- the word number “2” in the second line means the word “gasoline” (see FIG. 5A), and the word “gasoline” means that the word region is “1”, that is, DC [1].
- the document area list shown in FIG. 6B is a table representing the relationship between each document number and the document area number. This indicates in which document area the document specified by the document number enters the document area number.
- the cluster creation unit 5 can specifically include any one of the following devices that realizes a two-dimensional clustering process.
- Clustering devices that implement two-dimensional clustering processing include co-clustering devices based on information theory (see Related Technology 1) and clustering devices using Non-Negative Matrix Factorization (hereinafter referred to as “NMF”) (See Related Technology 2). And a clustering apparatus using PLSA (see Related Art 3).
- NMF Non-Negative Matrix Factorization
- the above-mentioned “co-clustering device based on information theory” further inputs the number of word regions to be created and the number of document regions to be created in addition to the word document matrix.
- the clustering process by this apparatus is to minimize the difference between the mutual information amount between the word and the document before the clustering process and the mutual information amount between the word and the document after the clustering process. And classify words and documents. Due to this feature, when the word document matrix is divided into the word region and the document region, a high-density region having a high correlation and containing a lot of 1 and a low-density region having a low correlation and containing only 0 (zero) Is generated.
- FIGS. 7A and 7B are diagrams for explaining two-dimensional clustering processing by the co-clustering apparatus based on information theory.
- FIG. 7A shows the state of the word document matrix before clustering
- FIG. 7B shows the state of the word document matrix after clustering. Indicates the state.
- the number of word areas to be created and the number of document areas to be created are each set to 4.
- the degree of shading in each region indicates a ratio including “1”.
- the words and the documents are evenly distributed, and the portions that are “1” are evenly distributed in the matrix.
- the word document matrix after clustering shown in FIG. 7B is obtained by rearranging words and documents in the word document matrix before clustering for each word area number assigned by the clustering apparatus and for each document area number. Yes.
- the word document matrix after clustering has a high-density area and a low-density area. Therefore, when a certain document set is determined, which word area has a set of words highly correlated with the document set. It is clear whether it is included in
- the clustering process is performed by inputting the word document matrix and the number K of clusters to be created.
- These two clustering devices regard both words and documents as a set of concepts and classify each word and each document into K concepts. Therefore, if a cluster of concepts assigned to a word is regarded as a word region, and a cluster of concepts assigned to a document is regarded as a document region, the processing performed by these two clustering devices is also regarded as two-dimensional clustering. be able to.
- each clustering device described above is intended to divide words and documents into cluster sets of the same concept, and generates a set of regions so as to diagonalize the word document matrix. Is preferred.
- FIG. 8A and FIG. 8B are used to more intuitively explain the processing result.
- FIGS. 8A and 8B are diagrams for explaining clustering processing for diagonalizing a word matrix.
- FIG. 8A shows a state of a word document matrix before clustering
- FIG. The state of the word document matrix after clustering is shown.
- the degree of shading in each region indicates a ratio including “1”.
- the word and the document share the classification destination cluster, so the word document matrix is divided so that it is diagonalized (dark areas are aligned on the diagonal).
- processing is performed on the word document matrix so that words and documents having similar concepts are grouped in the same region, and arbitrary processing for dividing the word and the document is 2 It can be understood as a dimension clustering process.
- the region summary creation unit 6 creates summary information from a plurality of regions generated by the clustering processing of the cluster creation unit 5 and stores the summary information in the summary matrix storage unit 1.
- the summary information is information for calculating the upper limit of the dynamic region frequency of words in each region when an input document list is given, and is called a summary matrix.
- the region summary creation unit 6 refers to the word index in the word document matrix storage unit 7 from the word region list and document region list output from the cluster creation unit 5, and summarizes the matrix. Create The region summary creation unit 6 can create, for example, the summary matrix shown in FIG. 9 or the summary matrix shown in FIG.
- FIG. 9 is a diagram showing an example of the summary matrix in the first embodiment.
- the maximum value of the static area frequency of the words in each area is used as information for calculating the upper limit of the dynamic area frequency for each area of the word document matrix divided by 4 ⁇ 4. Has been identified.
- the first line of this summary matrix is “48, 1, 0, 7”. This is because words included in the word area WC [0] are present in a maximum of 48 documents in the document area DC [0], are present in a maximum of 1 document in the document area DC [1], and are maximum 0 in the DC [2]. This means that there is no document, that is, one document, and there are seven documents at the maximum in the document area DC [3].
- FIG. 10 is a diagram showing another example of the summary matrix in the first embodiment.
- the example of FIG. 10 for each area of the word document matrix divided by 4 ⁇ 4, whether or not a word appears in the document in each area as summary information for calculating the upper limit of the dynamic area frequency.
- a bit string indicating whether or not is specified.
- bit string ⁇ 1, 1, 1, 1,... ⁇ Is specified for the region C [0] [0].
- This bit string means that a word in the word area WC [0] may appear in at least the first, second, third, and fourth documents in the document area DC [0].
- bit string of ⁇ 0, 1, 0, 0. This bit string means that a word in the word area WC [1] may appear in at least the second document in the document area DC [0].
- the region summary creation unit 6 that creates the summary matrix shown in FIG. 9 is referred to as a “maximum value summary creation unit”.
- the maximum value summary creation unit 6 obtains at least the frequency (static region frequency) of the word included in each region for each of the plurality of regions generated by the division by the clustering process. Further, the maximum value summary creation unit 6 specifies the maximum value of the obtained static area frequency. Specifically, the maximum value summary creation unit 6 uses the word area list and the document area list input by the cluster creation unit 5. Further, the maximum value summary creation unit 6 reads the word document matrix portion corresponding to each region from the word document matrix storage unit 7 and calculates the maximum value of the static region frequency of the word (see FIG. 9).
- the region summary creation unit 6 that creates the summary matrix shown in FIG. 10 is referred to as a “vector summary creation unit”.
- the vector summary creation unit 6 obtains a bit string that represents at least whether a word in the region is included in the document in each region.
- the vector summary creation unit 6 uses the word area list and document area list input by the cluster creation unit 5.
- the vector summary creation unit 6 reads the word document matrix portion corresponding to each region from the word document matrix storage unit 7 and checks whether each document in the region includes any word in the region. Then, the vector summary creation unit 6 creates a bit string (see FIG. 10) having the same length as the number of documents in the area based on the examination result.
- the summary matrix created by the maximum value summary creation unit and the vector summary creation unit in this way is stored in the summary matrix storage unit 1. Thereafter, it is used for search processing by the area upper limit calculation unit 2, the word frequency calculation unit 3, and the document frequency reference unit 4.
- the area upper limit calculation unit 2 is called when at least the input document list is input to the search device from an input device operated by the user or an external module. Then, the region upper limit calculation unit 2 refers to the input document list, the document region list passed from the cluster creation unit 5, and the information in the summary matrix, and dynamically determines the words included in each region of the word document matrix. Calculate the upper limit of the region frequency. Further, the area upper limit calculation unit 2 generates an area upper limit matrix from the calculated upper limit, and outputs this and the word area list to the word frequency calculation unit 3.
- FIG. 11 is a diagram illustrating an example of the area upper limit matrix according to the first embodiment.
- the value in the region C [0] [0] is “19”. This means that the words included in the word area WC [0] appear only 19 times in the document set of the document area [0] in the input document list.
- the area upper limit calculation unit 2 first compares the input document list and the document area list (FIG. 6B), and checks which document area each document in the input document list enters. Then, the area upper limit calculation unit 2 counts the number of documents in the input document list included in the area for each document area (number of document inputs). Next, the region upper limit calculation unit 2 refers to the summary matrix shown in FIG. 9 for each region, and obtains the maximum value of the static region frequency of words in each region. Then, the area upper limit calculation unit 2 compares the maximum value of the static area frequency with the number of document inputs for each area, and calculates the smaller one as the upper limit.
- the area upper limit calculation unit 2 compares the input document list and the document area list, checks which document area each document in the input document list enters, and then inputs the input document for each document area. It is also possible to create a bit string (input bit string) indicating whether or not a document in the list is included.
- the area upper limit calculation unit 2 refers to the summary matrix shown in FIG. 10 for each area, performs an AND operation on the generated bit string and the bit string of the summary matrix, and “1” of the bit string obtained as a result. "Is calculated as the upper limit.
- the word frequency calculation unit 3 adds the region upper limit matrix output from the region upper limit calculation unit 2 for each word region, and outputs the sum value for each word region as a word upper limit list. For example, if the area upper limit matrix shown in FIG. 11 is input and is added in the horizontal direction (for each word area), the word upper limit list shown in FIG. 12 is obtained.
- FIG. 12 is a diagram showing an example of the word upper limit list according to the first embodiment.
- the word upper limit list is obtained by calculating the upper limit value of the dynamic area frequency for each word area from the document area list and the input document list.
- the word upper limit list shown in FIG. 12 indicates the upper limit value of the dynamic area frequency for each word area calculated for a certain input document list. Specifically, in FIG. 12, for example, the upper limit value is “27” in the word region WC [0]. This means that words in the word area WC [0] appear only 27 times in the input document.
- the document frequency reference unit 4 receives the word upper limit list output from the word frequency calculation unit 3 and the word area list output from the cluster creation unit 5 based on the word upper limit list. Narrow down the word area. Further, the document frequency reference unit 4 refers to the words in each word region and the document list based on the word region list, identifies the top k words having the highest frequency (document frequency), and outputs this. .
- the document frequency reference unit 4 is a means for referring to a word in a certain word region and its document list. In the first embodiment, other means can be used as the document frequency reference unit 4.
- FIGS. 4 to 12 a search method according to the first embodiment of the present invention will be described with reference to FIGS.
- the search method according to the first embodiment is implemented by operating the search device according to the first embodiment shown in FIG. Therefore, in the following description, the operation of the search device according to the first embodiment will be described with reference to FIGS. 4 to 12 as appropriate.
- the clustering process is performed by the cluster creation unit 5 and the region summary creation unit 6.
- the search process P20 is performed by the area upper limit calculation unit 2, the word frequency calculation unit 3, and the document frequency reference unit 4.
- the clustering process P10 and the search process P20 may be performed continuously or separately. Below, these are demonstrated in order.
- FIG. 13 is a flowchart showing processing in the clustering process performed by the search method according to the first embodiment of the present invention.
- the clustering process P10 is started by an administrator (user) of the search device or an external module.
- the cluster creation unit 5 reads the word document matrix stored in the word document matrix storage unit 7 and performs a two-dimensional clustering process (step S1). P11). As a result, the word area list (see FIG. 6A) and the document area list (see FIG. 6B) are output to the area summary creation unit 6. At least the document area list is output to the area upper limit calculation unit 2. In step P11, the cluster creation unit 5 may output the word area list to the document frequency reference unit 4.
- the region summary creation unit 6 uses the word region list and document region list passed from the cluster creation unit 5. Furthermore, the region summary creation unit 6 refers to the word document matrix storage unit 7, creates a summary matrix (see FIG. 9 or FIG. 10) for each region, and stores this in the summary matrix storage unit 1 (step P12). . After execution of step P12, the clustering process P10 ends.
- step P12 will be described more specifically with reference to FIG.
- FIG. 14 is a flowchart specifically showing Step P12 shown in FIG.
- step P12 a region C [i] [j] (0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n) created based on the input word region list and document region list.
- steps P1201 to P1203 are executed.
- the region summary creation unit 6 extracts only the part of the word document matrix corresponding to the region C [i] [j] from the word document matrix storage unit 7 (step P1201).
- the area summary creation unit 6 creates a summary matrix (summary information) for the part extracted in step P1201 (step P1202).
- the region summary creation unit 6 stores the summary matrix created in P1202 in a region corresponding to the region C [i] [j] in the summary matrix storage unit 1 (P1203).
- step P1201 shown in FIG. 14 will be described more specifically with reference to FIG.
- FIG. 15 is a flowchart showing in detail Step P1201 shown in FIG.
- the step P1201 inputs the word area number i and the document area number j and refers to the word document matrix storage unit 7 so that the part of the word document matrix corresponding to the area C [i] [j] is represented in the word document list. Is output as
- the area summary creation unit 6 first creates a word document list for output in an empty state (P12010). In other words, the area summary creation unit 6 initializes an output word list.
- the region summary creation unit 6 extracts a list of word numbers whose word region number is equal to i from the word region list (step P12011). Subsequently, the region summary creation unit 6 performs the following processes from Step P12012 to P12015 for each word number s in the extracted word number list.
- the region summary creation unit 6 refers to the word index in the word document matrix storage unit 7 and reads the document list (step P12012).
- the area summary creation unit 6 adds the word number s and the set of document numbers extracted in P12013 to the word document list (step P12014).
- the area summary creation unit 6 determines whether or not the processing has been completed for all the words extracted in Step P12011 (Step P12015). As a result of the determination, if not completed, the region summary creation unit 6 executes Steps P12012 to P12014 again. On the other hand, if the processing has ended, the region summary creation unit 6 ends the processing.
- the “word document list” is a list of word numbers of words in the region C [i] [j] and document numbers of documents in the region C [i] [j] in which the word appears (in the region Document list).
- FIG. 16 is a diagram illustrating an example of a word document list. In the example of FIG. 16, it is assumed that the word areas WC [0] include the words with the word numbers 1, 3, and 5 and the document area DC [0] includes the documents with the document numbers 1, 2, 3, and 4. The word document list created for the area C [0] [0] is shown.
- step P1202 shown in FIG. 14 will be specifically described.
- the maximum value summary creation unit checks the in-region document list for each word number based on the word document list output in P1201, and the number of document numbers included in the in-region document list ( That is, the static area frequency) is calculated and the maximum value is output.
- FIG. 17 is a flowchart showing in detail Step P1202 shown in FIG.
- the vector summary creation unit performs steps from P12021 to P12024 for each region C [i] [j], and outputs the created bit string B.
- bit string B created for each region C [i] [j] has the following meaning.
- Each element in the bit string B corresponds to each document in the area C [i] [j], and each element value has one or more words in the area C [i] [j]. This means whether there is a possibility (“1”) or not (“0”).
- the correspondence table is a table indicating what number each document in the document area DC [j] is in the document area DC [j].
- FIG. 18 is a diagram illustrating an example of a correspondence table created by the vector summary creation unit. As shown in FIG. 18, for example, the document with the document number “1” is shown to be the first element. It can also be seen that the document with the document number “10” is the fifth element.
- the vector summary creation unit creates a vector V having the same length as that of the correspondence table and all elements being “0 (zero)” (P12022). Further, the vector summary creation unit counts how many times the document number q appears in the word document list when the position of each document number q in the correspondence table is “r”, The counted value is stored in the “r” th of (P12023). Thereby, the vector V becomes a vector representing how many times “1” appears for each document in the region C [i] [j].
- the vector summary creation unit converts all values of “1” or more in the vector V to “1”, and creates a bit string B in which the other values are “0 (zero)” (step P12024).
- step P12024 a process of “converting all values of 1 or more to“ 1 ”and setting the others to“ 0 ”” is performed on the vector V. Instead, the following step P12024 ′ is performed. May be done.
- step P12024 ′ all the values of the vector V that are greater than or equal to the threshold value ⁇ are converted to “1”, and other values are set to “0” to create a bit string.
- the threshold value ⁇ is set in advance by the administrator of the search device.
- the value of each element of the bit string B created in step P12024 ′ is likely to have a word in the same area C [i] [j] (“1”) or not (“0”). )).
- FIG. 19 is a flowchart showing processing in a search process performed by the search method according to Embodiment 1 of the present invention.
- the search process P20 is started when the user or an external program (module) inputs the input document list to the area upper limit calculation unit 2.
- the area upper limit calculation unit 2 calculates the upper limit of the dynamic area frequency for each area and outputs an area upper limit matrix (step P21).
- the word frequency calculation unit 3 sums the area upper limit matrix in the word area direction and outputs a word upper limit list (step P22).
- the document frequency reference unit 4 refers to the word document matrix storage unit 7 using the word upper limit list as an input. Further, the document frequency reference unit 4 refers to the document list while narrowing down the word area, and outputs the top k words with high frequency for the input document list (step P23). Specifically, in step P23, the document frequency calculation unit 4 specifies a characteristic word in the input document while referring to the frequency (document frequency) of each word. After execution of Step P23, the process in the search process P20 ends.
- Step P21 is a process in which the area upper limit calculation unit 2 receives the input document list and outputs an area upper limit matrix CMax [i] [j] ⁇ 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n ⁇ .
- a process (P210) executed by the area upper limit calculation unit 2 referring to the summary matrix shown in FIG. 9 will be described with reference to FIG.
- FIG. 20 is a flowchart illustrating an example of a specific example of Step P21 illustrated in FIG.
- the area upper limit calculation unit 2 examines the document area list output by the cluster creation unit 5 based on the document number in the input document list, and determines how many in each document area DC [j]. Is counted (step P2101). This output (the counted value) is defined as DCF [j] ⁇ 0 ⁇ j ⁇ n ⁇ . DCF [j] means the number of documents included in the document area DC [j] among the documents in the input document list.
- the region upper limit calculation unit 2 refers to the summary matrix storage unit 1 for each region of the word document matrix and executes Steps P2102 to P2105. Specifically, the region upper limit calculation unit 2 refers to the summary matrix storage unit 1 for each region C [i] [j] ⁇ 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n ⁇ , The maximum value of the target area frequency is extracted (P2102). Note that the maximum value of the static area frequency of words for the area C [i] [j] is sMax.
- the region upper limit calculation unit 2 compares the magnitude relationship between sMax and DCF [j] (P2103). If sMax> DCF [j] is established as a result of the comparison, the region upper limit calculation unit 2 substitutes DCF [j] for CMax [i] [j] (P2104). This is because if there are only DCF [j] documents in the area C [i] [j] (documents in the document area DC [j]) in the input document list, DCF [j] times in the area. This is because there cannot be many words that appear.
- sMax is substituted into CMMax [i] [j] (P2105). This is because if the static area frequency of the words in the area C [i] [j] is at most sMax times, there cannot be a word that appears more than sMax times in the area.
- the area upper limit calculation unit 2 determines whether or not the processing of Steps P2102 to P2105 has been completed for all areas (Step P2106).
- the area upper limit calculation unit 2 executes Steps P2102 to P2105 again when it is not completed as a result of the determination, and ends the process when it is completed.
- FIG. 21 is a flowchart showing another example of the specific example of step P21 shown in FIG.
- the area upper limit calculation unit 2 examines the document area list output by the cluster creation unit 5 based on the document number in the input document list, and sets n for each document area DC [j]. Pieces of bit strings are created (P2111). These n bit strings are defined as dynamic bit string DCB [j] ⁇ 0 ⁇ j ⁇ n ⁇ .
- the dynamic bit string DCB [j] has the length of the number of documents included in the document area DC [j], and each element of the dynamic bit string DCB [j] may include each document in the document area DC [j] in the input document list. It represents a binary value of whether there is ("1") or not ("0").
- the area upper limit calculation unit 2 executes processes P2112 to P2113 for each area C [i] [j]. Specifically, the region upper limit calculation unit 2 refers to the summary matrix storage unit 1 for each region C [i] [j] ⁇ 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n ⁇ , and summarize matrix storage unit 1 The bit string B stored therein is extracted (step P2112).
- the area upper limit calculation unit 2 performs an AND operation on the two bit strings of the bit string B and the dynamic bit string DCB [j], and calculates the number of bits for which the operation result is “1” as CMax [i] [j. ] (Step P2113).
- the region C [i] [j] a set of documents that may contain words in the word region WC [i], which is meant by the bit string B, and an input document that is meant by DCB [j]
- a logical product with a set of documents included in the list is taken. This means that the upper limit of the dynamic region frequency of the words in the target region is estimated.
- the area upper limit calculation unit 2 determines whether or not the processing of Steps P2112 to P2113 has been completed for all areas (Step P2114).
- the area upper limit calculation unit 2 executes Steps P2112 to P2113 again when the determination result shows that the process has not ended, and ends the process when the process has ended.
- Step P2111 is a step in which the area upper limit calculation unit 2 outputs the dynamic bit string DCB [j] ⁇ 0 ⁇ j ⁇ n ⁇ with the input document list and the document area list as inputs, as described above.
- the area upper limit calculation unit 2 executes the processes from P21111 to P21114 for each document area DC [j] ⁇ 0 ⁇ j ⁇ n ⁇ to create a bit string DCB [j] and output it. To do.
- the area upper limit calculation unit 2 refers to the correspondence table based on each document number obtained in step P21113, checks the position r, and sets the r-th bit in the bit string DCB [j] to “1”. Change (step P21114). A process is complete
- Step P22 is a process in which the word frequency calculation unit 3 outputs the word upper limit list with the area upper limit matrix as an input, as described above.
- the word frequency calculation unit 2 The calculation process expressed by Equation (1) is performed.
- the word frequency calculation unit 2 sums up the frequency of each word area for each word area by the following formula (1), and calculates the upper limit of the document frequency for the input document list. As a result, the word Output as upper limit list.
- step P23 shown in FIG. 19 will be described in detail.
- the document frequency reference unit 4 uses the word upper limit list as an input, and calculates the top k words with high frequency for the input document list while narrowing down the word area. It is.
- FIG. 23 is a flowchart showing a specific example of step P23 shown in FIG.
- the document frequency reference unit 4 first sets the variable q to 1 and initializes the output word list W with empty (step P2301).
- the word list W is a list in which the top k words having the highest document frequency for the document set specified by the input document list and the frequency (document frequency) are stored.
- the document frequency reference unit 4 refers to the word upper limit list and extracts the word area number X of the word area whose upper limit value is the qth largest (step P2302). Further, the document frequency reference unit 4 refers to the frequency (document frequency) of each word in the word region WC [X] based on the word region list acquired from the cluster creation unit 5 (step P2303).
- the document frequency reference unit 4 acquires a list of word numbers in the word area WC [X] based on the word area list, and obtains a document list corresponding to each unit number. Extract from the word index in the word document matrix. Then, the document frequency reference unit 4 compares the extracted document list with the input document list to check the frequency (document frequency) with respect to the input document list, and refers to the checked frequency to increase the frequency. k words are extracted and the word list W is updated (P2303).
- the document frequency reference unit 4 compares the frequency of the k-th word in the word list W with the upper limit value q + 1 of the upper limits of the word upper limit list, and checks the magnitude relationship between the two (P2304). As a result of the comparison, when the frequency of the k-th word in the word list W is larger than the q + 1th largest upper limit value, the document frequency reference unit 4 transfers the word in the word list W to an external display device or external program. Output (P2305), and the process ends. This is because in this case, the other word area does not include a word having a frequency higher than that of the current k-th word.
- Step P2302 is executed.
- step P2303 will be described in detail with reference to FIG.
- FIG. 24 is a flowchart specifically showing Step P2303 shown in FIG.
- the word area number X is input.
- the document frequency reference unit 4 checks the word region list acquired from the cluster creation unit 5 based on the input word region number X, and lists the word numbers of words that enter the word region WC [X]. Is created (step P23031).
- the document frequency reference unit 4 refers to the word table and the word index in the word document matrix storage unit 7 for each word number obtained in step P23031, and sets a pair of the word itself and the document list. Extract (step P23032).
- the document frequency reference unit 4 compares the document list with the input document list using the pair of the word and the document list obtained in step P23032, and counts the number of document numbers (document frequency) included in both. Then, a pair of word and document frequency is created and added to the word list W (step P23033).
- the document frequency reference unit 4 sorts the word list W in descending order of the document frequency, and deletes a pair of words having a frequency of k + 1 or less and the document frequency from the word list W (step P23034).
- step P23034 the processing in the document frequency reference unit 4 ends.
- step P2303 the word area list acquired from the cluster creation unit 5 is used, and the pair of the word in the area and the document list is referred to from the word area number. .
- information in the word document matrix storage unit 7 is referred to.
- this Embodiment 1 is not limited to this,
- region list may not be used but another arbitrary data may be referred.
- the data in the word document matrix storage unit 1 is divided in advance for each word region created by the cluster creation unit 5. Then, using the word area number as a key, an integrated data storage unit is prepared that can read the words and document list in each word area together. Then, in the process of step P23, the integrated data storage unit may be referred to and a word in a certain word area and a document list may be read together.
- the search target area that is, the word area to be read is determined, and therefore, useless words. Reading processing for the area is avoided, and speeding up of the search is achieved.
- the two-dimensional clustering process is performed on the word document matrix in advance, it is possible to efficiently narrow down the search.
- the area upper limit matrix can be created at high speed. Therefore, a word region to be read at a high speed is determined, and a reading process for a useless word region is avoided, so that the search can be further speeded up.
- the region upper limit matrix is created more strictly. For this reason, it becomes possible to narrow down more word areas, and it is possible to narrow down more unnecessary word areas, so that it is possible to further speed up the search.
- the program according to the first embodiment of the present invention may be a program that causes a computer to execute the steps shown in FIGS. 13 to 17 and FIGS. 19 to 24.
- a CPU central processing unit
- the search device according to the first embodiment is realized, and the search method according to the first embodiment can be executed.
- a CPU central processing unit of the computer functions as the cluster creation unit 5, the region summary creation unit 6, the region upper limit calculation unit 2, the word frequency calculation unit 3, and the document frequency reference unit 4, and performs processing.
- the summary matrix storage unit 1 and the word document matrix storage unit 7 store these data files in a storage device such as a hard disk provided in the computer, or This can be realized by mounting a recording medium storing a data file on a reading device connected to a computer.
- FIG. 25 is a block diagram showing a configuration of the search device according to Embodiment 2 of the present invention.
- the search device includes a plurality of cluster processing units 10, a cluster processing expansion unit 8, and a cluster processing selection unit 9.
- the search device shown in FIG. This is different from the search device in the first embodiment.
- the cluster creation unit 5, the region summary creation unit 6, the summary matrix storage unit 7, the region upper limit calculation unit 2, the word frequency calculation unit 3, and the document frequency reference unit 4 include a plurality of cluster processes.
- Each unit 10 is provided for each unit.
- the cluster processing expansion unit 8 When the cluster processing expansion unit 8 receives information representing a subset, that is, an input document list from a user or an external program, the cluster processing expansion unit 8 inputs the input information to the area upper limit calculation unit 2 of each cluster processing unit 10. To do.
- the cluster processing selection unit 9 receives the upper limit of the word frequency for each word region set by the word frequency calculation unit 3 of each of the plurality of cluster processing units 10. In addition, the cluster processing selection unit 9 selects at least one of the plurality of cluster processing units 10 for use in subsequent processing based on the distribution of the upper limit of the frequency of each received word. Then, the cluster processing selection unit 9 causes only the document frequency reference unit 4 of the selected cluster processing unit 10 to pass the upper limit set by the selected word frequency calculation unit 3 of the cluster processing unit 10 and perform processing.
- each of the plurality of cluster processing units 10 execute different clustering processes. As described in Embodiment 1, there are a plurality of types of algorithms in the two-dimensional clustering process. Furthermore, in the two-dimensional clustering process, it is necessary to set the number of word regions and the number of document regions for processing.
- each cluster processing unit 10 can execute a plurality of types of two-dimensional clustering processes in which these parameters are changed.
- the number of cluster processing units 10 is referred to as the number of cluster types.
- Each cluster processing unit 10 is called by an administrator or an external program. Each cluster processing unit 10 receives the word document matrix as an input, and when the input document list is given for each region in the word document matrix in the summary matrix storage unit 1, the dynamic region frequency of the words in the region is determined. Summary information (summary matrix) that can calculate the upper limit or its predicted value is output.
- Each cluster processing unit 10 is called by the cluster processing expansion unit 8 and when an input document list is input from the cluster processing expansion unit 8, an upper limit of the word frequency for each word region is set, and a cluster processing selection unit The word upper limit list is output to 9. Further, as described above, one of the cluster processing units 10 is called by the cluster processing selection unit 9 with the word upper limit list as an input, and the internal document frequency reference unit 4 has a high frequency in the input document list. Read k words and output to user or external program.
- FIGS. 25 a search method according to the second embodiment of the present invention will be described with reference to FIGS.
- the search method according to the second embodiment is implemented by operating the search device according to the second embodiment shown in FIG. Therefore, in the following description, the operation of the search device according to the second embodiment will be described with reference to FIG. 25 as appropriate.
- the clustering process P10 ′ is realized in each cluster processing unit 10 by the cluster creation unit 5 and the area summary creation unit 6 performing the clustering process P10.
- FIG. 26 is a flowchart showing processing in a search process performed by the search method according to Embodiment 2 of the present invention.
- the search process P20 ′ is started when a user or an external program inputs an input document list to the cluster processing development unit 8.
- step P21 ′ when an input document list is input from the outside, the cluster processing expansion unit 8 passes the input document list to each cluster processing unit 10 (step P21 ′). In step P21 ′, so-called query expansion is performed.
- each cluster processing unit 10 creates an upper limit list using the internal region upper limit calculation unit 2 and the word region calculation unit 3, and outputs it (step P22 '). Specifically, in step P22 ′, in each cluster processing unit 10, the region upper limit calculation unit 2 executes step P21 (see FIG. 19), and the word region calculation unit 3 performs step P22 (see FIG. 19). Execute. The processing of steps P21 and P22 may be performed simultaneously in each cluster processing unit 10.
- step P23 ′ the cluster processing selection unit 9 examines the word upper limit list output by each cluster processing unit 10, selects one or more cluster processing units, and selects the selected cluster processing unit 10
- the document frequency reference unit 4 is called (step P23 ′).
- the called document frequency reference unit 4 reads this while narrowing down the document list for the words in each word area based on the result of step P22 ′. Then, the document frequency reference unit 4 identifies k words having a high document frequency in the input document list and outputs them (step P24 ′). In step P24 ′, the document frequency calculation unit 4 of the selected cluster processing unit 10 specifies a characteristic word in the input document while referring to the frequency (document frequency) of each word. After the execution of step P24 ′, the process in the search process P20 ′ ends.
- Step P23 ′ a cluster processing unit 10 having a word upper limit list with the highest reading efficiency may be selected by inputting a plurality of word upper limit lists.
- the determination as to whether reading efficiency is good can be made, for example, by calculating the skewness of the distribution of the word upper limit list.
- the cluster process selection unit 9 calculates the skewness for each word upper limit list.
- the skewness Sk is expressed by the following equation (2).
- ⁇ and ⁇ in the following formula (2) can be calculated by the following formula (3) and formula (4), respectively.
- ” in the following formulas (2) to (4) means the number of words in each word area.
- is a value obtained by dividing the number of words in the total word set W by m.
- the cluster processing selection unit 9 selects the word upper limit list having the largest skewness Sk and calls the document frequency reference unit 4 that outputs the word upper limit list. As a result, the cluster processing unit 10 having the word upper limit list with the highest reading efficiency is selected, and the word is specified.
- the cluster processing selection unit 9 selects a single cluster processing unit 10 having a large skewness, but the second embodiment is not limited to this mode.
- the cluster processing selection unit 9 selects a plurality of cluster processing units 10, causes the document frequency reference unit 4 to perform processing within each cluster processing unit 10, and finishes the processing earliest. Only the word frequency reference unit 4 can output k words.
- a plurality of cluster processing units 10 may be realized by separate computers, and processing may be performed in parallel.
- the second embodiment since a plurality of types of two-dimensional clustering algorithms are used simultaneously, it is effective when an input document list that is difficult to narrow down is given. That is, even in such a case, the best word upper limit list is output by a plurality of types of cluster processing units 10, and narrowing down is performed using this. According to the second embodiment, even in such a case, speeding up of the search is achieved.
- the program according to the second embodiment of the present invention may be a program that causes a computer to execute Step P21 'to Step P24' shown in FIG.
- the search device according to the second embodiment is realized, and the search method according to the second embodiment can be executed.
- the CPU (central processing unit) of the computer is a cluster creation unit 5, a region summary creation unit 6, a region upper limit calculation unit 2, a word frequency calculation unit 3, and a document frequency reference unit 4 in each cluster processing unit 10. Functions and processes. Further, the CPU of the computer also functions as a cluster processing expansion unit 8 and a cluster processing selection unit 9 to perform processing.
- the program in the second embodiment may be a program that causes a plurality of computers to execute Step P21 ′ to Step P24 ′ shown in FIG.
- the CPUs of the computers function as separate cluster processing units 10 and the processes are executed in parallel.
- the summary matrix storage unit 1 and the word document matrix storage unit 7 store data files constituting them in a storage device such as a hard disk provided in the computer, or This can be realized by mounting the recording medium storing the data file in a reading device connected to the computer.
- the operation of the search device in the first embodiment will be described using a specific example.
- the area summary creation unit 6 functions as a maximum summary creation unit that creates the summary matrix shown in FIG.
- the clustering process P10 and the search process P20 in the first embodiment will be described.
- the cluster creation unit 5 reads the word document matrix shown in FIG. 2 and performs a two-dimensional clustering process (step P11). As a result, the word area list shown in FIG. 6A and the document area list shown in FIG. 6B are output. Thereby, logically, the area shown in FIG. 3 is created. However, in the example of FIG. 3, only a division into a part of the word document matrix is shown, but in actuality, it is assumed that the word document is divided into 4 areas ⁇ 4 document areas.
- the maximum summary creation unit executes a summary matrix creation process (step P12) based on the word area list and the document area list. Specifically, the maximum summary creation unit executes steps P1201 to P1203 and stores the information shown in FIG. 9 in the summary matrix storage unit 1.
- words and documents having the same tendency are collected by clustering processing, and an efficient summary matrix is created.
- the area upper limit calculation unit 2 first refers to the document area list shown in FIG. 6B for the documents in the input document list. Thereby, the area upper limit calculation unit 2 outputs the number of documents included in the input document as DCF for each of the four document areas (step P2101).
- the DCFs in this case are assumed to be ⁇ 19, 35, 3, 2 ⁇ in order.
- the area upper limit calculation unit 2 performs the processing of steps P2102 to P2106 and outputs an area upper limit matrix.
- the region upper limit matrix at this time is shown in FIG.
- the word frequency calculation unit 3 creates the word upper limit list shown in FIG. 12 (step P22).
- step P23 the document frequency reference unit 4 performs the process of step P23 (see FIG. 19) based on the word upper limit list of FIG.
- step P2302 the process of step P2302 (see FIG. 23) is performed on the word region WC [1] having the largest upper limit in the word upper limit list.
- step P2304 the third highest document frequency 25 in the word list W is compared with the upper limit value 27 of the word area WC [0] having the second largest upper limit in the word upper limit list of FIG. Is done. In this case, since the third highest document frequency 25 in the word list W is smaller, “q” is updated to 2 (step P2307), and the process of step P2302 is performed again.
- step P2304 is performed again, and the third highest document frequency 25 in the word list W is compared with the upper limit value 3 of the word region WC [3] having the third largest upper limit, and step P2305 is compared. Is executed. This is because words in other word areas that have not been examined at this time have a document frequency of only 6, and it can be determined that further reference processing is wasted. Therefore, the document frequency reference unit 4 outputs three words “crime”, “self”, and “death” in step P2305, and ends the process.
- the search device of the first embodiment it is possible to narrow down the word area to be read at the time of search, and the search speed can be increased.
- the region summary creation unit 6 functions as a vector summary creation unit that creates the summary matrix shown in FIG.
- the premise for explaining the operation example in the second embodiment is the same as that in the first embodiment, and it is assumed that the information shown in FIGS. 5A and 5B is held in the word document matrix storage unit 7.
- parameter k 3.
- step P11 in the clustering process P10 is the same as that in the first embodiment, it is omitted, and here, step P12 is focused.
- step P12 after the processing in step P1201, unlike the first embodiment using the maximum summary creation unit, the process P1202 (see FIG. 17) using the vector summary creation unit is executed. The As a result, the information shown in FIG. 10 is stored in the summary matrix storage unit 1.
- step P21 the area upper limit calculation unit 2 executes a process (P211 (see FIG. 21)) different from that in the first embodiment.
- a dynamic bit string DCB [j] ⁇ 0 ⁇ j ⁇ 4 ⁇ having a length of 4 is created from the input document list in step P2111.
- DCB [3] ⁇ 1, 1, 0, 1, 0.
- the area upper limit calculation unit 2 performs the processes of steps P2112 to P2114, but here, only the process for the area C [0] [3] is taken up.
- the area upper limit calculation unit 2 reads the bit string B from the summary matrix storage unit 1 in P2112.
- the bit string B at this time is ⁇ 0, 0, 1, 0, 1... ⁇ Shown in the upper right cell of FIG.
- this bit string B and the dynamic bit string DCB [3] are ANDed, ⁇ 0, 0, 0, 0, 0... ⁇ Is obtained, and CMax [0] [3] can be estimated smaller.
- the region upper limit matrix obtained by this processing is shown in FIG.
- Step P2303 the upper limit can be set to 20 for the word area WC [0], and the reference process for the word area WC [0] is performed during the document frequency reference process of step P23. (Step P2303) can be omitted.
- the search device uses the maximum value of the static area frequency in the area as the summary matrix. This is based on the information that the words included in the region appear at most X times, and the region upper limit calculation unit 2 narrows down to increase the speed.
- the first embodiment is effective when X is sufficiently small, but X may become large depending on the nature of the document set.
- the search device when the upper limit value of the dynamic region frequency is calculated for each region, the document set that actually includes words in the region is compared with the input document. As a result, when X is large and the distribution of the document set and the input document is different, the upper limit value in the region can be estimated small.
- step P23 ′ the operation of the search device in the second embodiment will be described.
- the operation will be described focusing on the processing of step P23 ′ (see FIG. 26).
- FIGS. 27A and 27B are diagrams showing examples of word upper limit lists obtained in the second embodiment, and FIGS. 27A and 27B show word upper limit lists obtained by different two-dimensional clustering processes.
- FIG. 27A shows a result obtained by a 4 ⁇ 4 two-dimensional clustering process.
- FIG. 27B shows a result obtained by the 5 ⁇ 5 two-dimensional clustering process.
- step P24 ′ should be performed on the result of FIG. 27A.
- the cluster processing selection unit 9 calculates the skewness for each cluster processing unit 10. The skewness with respect to the result of FIG. Similarly, the skewness with respect to the result of FIG. Thereby, the cluster processing selection unit 9 can select the cluster processing unit 10 that has output FIG. 27A and execute the document frequency reference processing (step P24 ′).
- a plurality of different types of cluster processing units 10 are provided. Therefore, it is possible to select the cluster processing unit 10 that narrows down the most word regions by the input document list from the word upper limit list output from each, and the selected cluster processing unit 10 can further speed up the search.
- the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the gist of the present invention already described.
- the present invention has been described as a hardware configuration, but the present invention is not limited to this.
- the present invention can also realize arbitrary processing by causing a CPU (Central Processing Unit) to execute a computer program.
- the computer program can be provided by being recorded on a recording medium, or can be provided by being transmitted via the Internet or another transmission medium.
- the recording medium includes, for example, a flexible disk, hard disk, magnetic disk, magneto-optical disk, CD-ROM (Compact Disc Read Only Memory), DVD (Digital Versatile Disc), BD (Blu-ray (registered trademark) Disc) ROM (Read Only Memory) cartridge, RAM (Random Access Memory) memory cartridge with battery backup, flash memory cartridge, nonvolatile RAM cartridge, and the like.
- the communication medium includes a wired communication medium such as a telephone line, a wireless communication medium such as a microwave line, and the like.
- the present invention can be applied to a search engine called an associative search engine, and is useful for searching a high-frequency word representing a concept closely related to a document set expressed by a query in a text mining system or a search engine. is there.
- the present invention has industrial applicability.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
要約マトリクス記憶部と、領域上限算出部と、単語頻度算出部と、文書頻度参照部とを備え、
前記要約マトリクス記憶部は、
単語集合と文書集合との間の共起関係を表現するマトリクスを分割して得られた複数の領域から作成され、且つ、文書集合に含まれる部分集合を表す情報が与えられた場合に、前記複数の領域それぞれにおける単語の頻度の算出又は予測を可能にする情報を、要約情報として記憶し、
前記領域上限算出部は、
前記部分集合を表す情報が入力されると、前記部分集合を表す情報と前記複数の領域との関係を調べ、得られた結果から、前記複数の領域それぞれに対する前記要約情報を参照して、前記複数の領域それぞれについて、それぞれに含まれる単語の、前記部分集合に対する頻度の上限を算出し、
前記単語頻度算出部は、
前記複数の領域それぞれについての前記頻度の上限を、前記単語が共通する領域毎に和算し、得られた和算値を、前記単語が共通する領域毎の単語の頻度の上限に設定し、
前記文書頻度参照部は、
前記単語が共通する領域毎の単語の頻度の上限に基づいて、検索対象となる領域を求め、更に、求めた前記検索対象となる領域に基づいて、設定された数の単語を頻度が高い順に特定し、特定した単語を前記部分集合に特徴的な単語として出力する、
ことを特徴とする。
(a)単語集合と文書集合との間の共起関係を表現するマトリクスを分割して得られた複数の領域から作成され、且つ、文書集合に含まれる部分集合を表す情報が与えられた場合に、前記複数の領域それぞれにおける単語の頻度の算出又は予測を可能にする情報を、要約情報として記憶するステップと、
(b)前記部分集合を表す情報が入力されると、前記部分集合を表す情報と前記複数の領域との関係を調べ、得られた結果から、前記複数の領域それぞれに対する前記要約情報を参照して、前記複数の領域それぞれについて、それぞれに含まれる単語の、前記部分集合に対する頻度の上限を算出するステップと、
(c)前記複数の領域それぞれについての前記頻度の上限を、前記単語が共通する領域毎に和算し、得られた和算値を、前記単語が共通する領域毎の単語の頻度の上限に設定するステップと、
(d)前記単語が共通する領域毎の単語の頻度の上限に基づいて、検索対象となる領域を求め、更に、求めた前記検索対象となる領域に基づいて、設定された数の単語を頻度が高い順に特定し、特定した単語を前記部分集合に特徴的な単語として出力するステップとを、有することを特徴とする。
前記コンピュータに、
(a)単語集合と文書集合との間の共起関係を表現するマトリクスを分割して得られた複数の領域から作成され、且つ、文書集合に含まれる部分集合を表す情報が与えられた場合に、前記複数の領域それぞれにおける単語の頻度の算出又は予測を可能にする情報を、要約情報として記憶するステップと、
(b)前記部分集合を表す情報が入力されると、前記部分集合を表す情報と前記複数の領域との関係を調べ、得られた結果から、前記複数の領域それぞれに対する前記要約情報を参照して、前記複数の領域それぞれについて、それぞれに含まれる単語の、前記部分集合に対する頻度の上限を算出するステップと、
(c)前記複数の領域それぞれについての前記頻度の上限を、前記単語が共通する領域毎に和算し、得られた和算値を、前記単語が共通する領域毎の単語の頻度の上限に設定するステップと、
(d)前記単語が共通する領域毎の単語の頻度の上限に基づいて、検索対象となる領域を求め、更に、求めた前記検索対象となる領域に基づいて、設定された数の単語を頻度が高い順に特定し、特定した単語を前記部分集合に特徴的な単語として出力するステップとを、実行させることを特徴とする。
以下、本発明の実施の形態1における検索装置、検索方法及びプログラムについて、図4~図24を参照しながら説明する。最初に、本実施の形態1における検索装置の構成について図4~図12を用いて説明する。図4は、本発明の実施の形態1における検索装置の構成を示すブロック図である。
[関連技術2]W.Xu, X.Liu and Y.Gong著「Document clustering based on non-negative matrix factorization」、In Proceedings of ACM SIGIR International Conference on pages 267-273、 2003.
[関連技術3]T.Hofmann著「Probabilistic Latent Semantic Analysis」、In Proceedings of Conference on Uncertainty in Artificial Intelligence on pages 289-296、 1999.
次に本発明の実施の形態2における検索装置、検索方法、およびプログラムについて、図25及び図26を参照しながら説明する。最初に、本実施の形態2における検索装置の構成について図25を用いて説明する。図25は、本発明の実施の形態2における検索装置の構成を示すブロック図である。
2 領域上限算出部
3 単語頻度算出部
4 文書頻度参照部
5 クラスタ作成部
6 領域要約作成部
7 単語文書マトリクス記憶部
8 クラスタ処理展開部
9 クラスタ処理選択部
10 クラスタ処理部
Claims (15)
- 単語集合と文書集合との間の共起関係を表現するマトリクスを分割して得られた複数の領域から作成され、且つ、文書集合に含まれる部分集合を表す情報が与えられた場合に、前記複数の領域それぞれにおける単語の頻度の算出又は予測を可能にする情報を、要約情報として記憶する要約マトリクス記憶手段と、
前記部分集合を表す情報が入力されると、前記部分集合を表す情報と前記複数の領域との関係を調べ、得られた結果から、前記複数の領域それぞれに対する前記要約情報を参照して、前記複数の領域それぞれについて、それぞれに含まれる単語の、前記部分集合に対する頻度の上限を算出する領域上限算出手段と、
前記複数の領域それぞれについての前記頻度の上限を、前記単語が共通する領域毎に和算し、得られた和算値を、前記単語が共通する領域毎の単語の頻度の上限に設定する単語頻度算出手段と、
前記単語が共通する領域毎の単語の頻度の上限に基づいて、検索対象となる領域を求め、更に、求めた前記検索対象となる領域に基づいて、設定された数の単語を頻度が高い順に特定し、特定した単語を前記部分集合に特徴的な単語として出力する文書頻度参照手段と、
を備える検索装置。 - 前記単語集合と前記文書集合との間の共起関係を特定するマトリクスを入力として、単語と文書に対するクラスタリング処理を行い、これによって、前記マトリクスを構成する単語集合と前記部分集合とを、それぞれ複数の部分集合に分割して、前記複数の領域を生成し、更に、その結果を前記領域要約作成手段に出力するクラスタ作成手段と、
前記クラスタリング処理による分割によって生成された前記複数の領域から、前記要約マトリクス記憶手段によって記憶される前記要約情報を作成する領域要約作成手段と、
を更に備える請求項1記載の検索装置。 - 前記領域要約作成手段が、前記クラスタリング処理による分割によって生成された前記複数の領域それぞれについて、少なくとも、各領域内に含まれる単語の当該領域における頻度を求め、更に、求めた前記頻度の最大値を特定し、特定した前記最大値を前記要約情報とし、
前記領域上限算出手段が、前記部分集合を表す情報と前記複数の領域との関係として、前記複数の領域それぞれ毎に、当該領域に含まれている、前記部分集合を構成する文書の数を求め、求めた前記文書の数と、当該領域についての前記頻度の最大値とを比較し、比較結果に基づいて、前記頻度の上限を算出する、請求項2に記載の検索装置。 - 前記領域要約作成手段が、前記クラスタリング処理による分割によって生成された前記複数の領域それぞれについて、少なくとも、各領域内の文書において当該領域内の単語が含まれるか否かを表すビット列を求め、求めた前記ビット列を前記要約情報とし、
前記領域上限算出手段が、前記部分集合を表す情報と前記複数の領域との関係として、前記複数の領域それぞれ毎に、前記部分集合を構成する文書が、当該領域に含まれているか否かを表すビット列を求め、当該領域上限算出手段によって求められたビット列と、前記領域要約作成手段によって求められたビット列とのAND演算を実行し、演算結果に基づいて、前記頻度の上限を算出する、請求項2に記載の検索装置。 - 複数のクラスタ処理手段と、クラスタ処理展開手段と、クラスタ処理選択手段とを更に備え、
前記クラスタ作成手段、前記領域要約作成手段、前記要約マトリクス記憶手段、前記領域上限算出手段、前記単語頻度算出手段、及び前記文書頻度参照手段は、前記複数のクラスタ処理手段それぞれに毎に備えられ、
前記複数のクラスタ処理手段それぞれにおいて、前記クラスタ作成手段それぞれは、互いに異なるクラスタリング処理を行い、
前記クラスタ処理展開手段は、前記部分集合を表す情報が入力されると、入力された情報を、前記複数のクラスタ処理手段それぞれの前記領域上限算出手段に入力し、
前記クラスタ処理選択手段は、前記複数のクラスタ処理手段それぞれの前記単語頻度算出手段が設定した、前記単語が共通する領域毎の単語の頻度の上限を受け取り、受け取った前記各単語の頻度の上限の分布に基づいて、前記複数のクラスタ処理手段のうちの少なくとも一つを選択し、選択した前記クラスタ処理手段の前記文書頻度参照手段のみに、処理を行わせる、請求項2に記載の検索装置。 - (a)単語集合と文書集合との間の共起関係を表現するマトリクスを分割して得られた複数の領域から作成され、且つ、文書集合に含まれる部分集合を表す情報が与えられた場合に、前記複数の領域それぞれにおける単語の頻度の算出又は予測を可能にする情報を、要約情報として記憶し、
(b)前記部分集合を表す情報が入力されると、前記部分集合を表す情報と前記複数の領域との関係を調べ、得られた結果から、前記複数の領域それぞれに対する前記要約情報を参照して、前記複数の領域それぞれについて、それぞれに含まれる単語の、前記部分集合に対する頻度の上限を算出し、
(c)前記複数の領域それぞれについての前記頻度の上限を、前記単語が共通する領域毎に和算し、得られた和算値を、前記単語が共通する領域毎の単語の頻度の上限に設定し、
(d)前記単語が共通する領域毎の単語の頻度の上限に基づいて、検索対象となる領域を求め、更に、求めた前記検索対象となる領域に基づいて、設定された数の単語を頻度が高い順に特定し、特定した単語を前記部分集合に特徴的な単語として出力する、ことを特徴とする検索方法。 - (e)前記単語集合と前記文書集合との間の共起関係を特定するマトリクスを入力として、単語と文書に対するクラスタリング処理を行い、これによって、前記マトリクスを構成する単語集合と前記部分集合とを、それぞれ複数の部分集合に分割して、前記複数の領域を生成し、
(f)前記(e)のクラスタリング処理による分割によって生成された前記複数の領域から、前記(a)のステップで記憶される前記要約情報を作成する、請求項6に記載の検索方法。 - 前記(f)の作成において、前記クラスタリング処理による分割によって生成された前記複数の領域それぞれについて、少なくとも、各領域内に含まれる単語の当該領域における頻度を求め、更に、求めた前記頻度の最大値を特定し、特定した前記最大値を前記要約情報とし、
前記(c)の設定において、前記部分集合を表す情報と前記複数の領域との関係として、前記複数の領域それぞれ毎に、当該領域に含まれている、前記部分集合を構成する文書の数を求め、求めた前記文書の数と、当該領域についての前記頻度の最大値とを比較し、比較結果に基づいて、前記頻度の上限を算出する、請求項7に記載の検索方法。 - 前記(f)の作成において、前記クラスタリング処理による分割によって生成された前記複数の領域それぞれについて、少なくとも、各領域内の文書において当該領域内の単語が含まれるか否かを表すビット列を求め、求めた前記ビット列を前記要約情報とし、
前記(c)の設定において、前記部分集合を表す情報と前記複数の領域との関係として、前記複数の領域それぞれ毎に、前記部分集合を構成する文書が、当該領域に含まれているか否かを表すビット列を求め、求められたビット列と、前記(f)の作成によって求められたビット列とのAND演算を実行し、演算結果に基づいて、前記頻度の上限を算出する、請求項7に記載の検索方法。 - 前記(e)の生成が、前記クラスタリング処理の種類を変えて複数回実行され、
前記(f)の作成と、前記(a)~前記(c)とが、前記(e)の生成が実行される度に実行される場合において、
各(e)の生成に対応する前記(c)で設定された、前記単語が共通する領域毎の単語の頻度の上限から、それぞれの上限の分布に基づいて、少なくとも一つの、前記単語が共通する領域毎の単語の頻度の上限を選択し、選択した上限を用いて、前記(d)の出力を実行させる、請求項7に記載の検索方法。 - (a)単語集合と文書集合との間の共起関係を表現するマトリクスを分割して得られた複数の領域から作成され、且つ、文書集合に含まれる部分集合を表す情報が与えられた場合に、前記複数の領域それぞれにおける単語の頻度の算出又は予測を可能にする情報を、要約情報として記憶する処理と、
(b)前記部分集合を表す情報が入力されると、前記部分集合を表す情報と前記複数の領域との関係を調べ、得られた結果から、前記複数の領域それぞれに対する前記要約情報を参照して、前記複数の領域それぞれについて、それぞれに含まれる単語の、前記部分集合に対する頻度の上限を算出する処理と、
(c)前記複数の領域それぞれについての前記頻度の上限を、前記単語が共通する領域毎に和算し、得られた和算値を、前記単語が共通する領域毎の単語の頻度の上限に設定する処理と、
(d)前記単語が共通する領域毎の単語の頻度の上限に基づいて、検索対象となる領域を求め、更に、求めた前記検索対象となる領域に基づいて、設定された数の単語を頻度が高い順に特定し、特定した単語を前記部分集合に特徴的な単語として出力する処理とを、コンピュータに実行させるためのプログラムが格納された記録媒体。 - (e)前記単語集合と前記文書集合との間の共起関係を特定するマトリクスを入力として、単語と文書に対するクラスタリング処理を行い、これによって、前記マトリクスを構成する単語集合と前記部分集合とを、それぞれ複数の部分集合に分割して、前記複数の領域を生成する処理と、
(f)前記(e)の処理におけるクラスタリング処理による分割によって生成された前記複数の領域から、前記(a)の処理で記憶される前記要約情報を作成する、処理とを、更に前記コンピュータに実行させる、請求項11に記載のプログラムが格納された記録媒体。 - 前記(f)の処理において、前記クラスタリング処理による分割によって生成された前記複数の領域それぞれについて、少なくとも、各領域内に含まれる単語の当該領域における頻度を求め、更に、求めた前記頻度の最大値を特定し、特定した前記最大値を前記要約情報とし、
前記(c)の処理において、前記部分集合を表す情報と前記複数の領域との関係として、前記複数の領域それぞれ毎に、当該領域に含まれている、前記部分集合を構成する文書の数を求め、求めた前記文書の数と、当該領域についての前記頻度の最大値とを比較し、比較結果に基づいて、前記頻度の上限を算出する、請求項12に記載のプログラムが格納された記録媒体。 - 前記(f)の処理において、前記クラスタリング処理による分割によって生成された前記複数の領域それぞれについて、少なくとも、各領域内の文書において当該領域内の単語が含まれるか否かを表すビット列を求め、求めた前記ビット列を前記要約情報とし、
前記(c)の処理において、前記部分集合を表す情報と前記複数の領域との関係として、前記複数の領域それぞれ毎に、前記部分集合を構成する文書が、当該領域に含まれているか否かを表すビット列を求め、求められたビット列と、前記(f)の処理によって求められたビット列とのAND演算を実行し、演算結果に基づいて、前記頻度の上限を算出する、請求項12に記載のプログラムが格納された記録媒体。 - 前記(e)の処理が、前記クラスタリング処理の種類を変えて複数回実行され、
前記(f)の処理と、前記(a)~前記(c)の処理とが、前記(e)の処理が実行される度に実行される場合において、
各(e)の処理に対応する前記(c)の処理で設定された、前記単語が共通する領域毎の単語の頻度の上限から、それぞれの上限の分布に基づいて、少なくとも一つの、前記単語が共通する領域毎の単語の頻度の上限を選択し、選択した上限を用いて、前記(d)の処理を実行させる、処理を更に前記コンピュータに実行させる、請求項12に記載のプログラムが格納された記録媒体。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2010540323A JP5594145B2 (ja) | 2008-11-26 | 2009-11-06 | 検索装置、検索方法、及びプログラム |
US13/129,342 US8892574B2 (en) | 2008-11-26 | 2009-11-06 | Search apparatus, search method, and non-transitory computer readable medium storing program that input a query representing a subset of a document set stored to a document database and output a keyword that often appears in the subset |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008-300793 | 2008-11-26 | ||
JP2008300793 | 2008-11-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010061537A1 true WO2010061537A1 (ja) | 2010-06-03 |
Family
ID=42225427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/005907 WO2010061537A1 (ja) | 2008-11-26 | 2009-11-06 | 検索装置、検索方法、及びプログラムが格納された記録媒体 |
Country Status (3)
Country | Link |
---|---|
US (1) | US8892574B2 (ja) |
JP (1) | JP5594145B2 (ja) |
WO (1) | WO2010061537A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012088930A (ja) * | 2010-10-19 | 2012-05-10 | Chuden Cti Co Ltd | 入力情報分析装置 |
CN103970910A (zh) * | 2014-05-27 | 2014-08-06 | 南京大学 | 一种基于篇章文档的自适应输入法 |
US20140365510A1 (en) * | 2013-06-11 | 2014-12-11 | Konica Minolta, Inc. | Device and method for determining interest, and computer-readable storage medium for computer program |
US11392568B2 (en) * | 2015-06-23 | 2022-07-19 | Microsoft Technology Licensing, Llc | Reducing matching documents for a search query |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102023989B (zh) * | 2009-09-23 | 2012-10-10 | 阿里巴巴集团控股有限公司 | 一种信息检索方法及其系统 |
US9323753B2 (en) * | 2011-02-23 | 2016-04-26 | Samsung Electronics Co., Ltd. | Method and device for representing digital documents for search applications |
US11461533B2 (en) * | 2014-10-15 | 2022-10-04 | International Business Machines Corporation | Generating a document preview |
US10599700B2 (en) * | 2015-08-24 | 2020-03-24 | Arizona Board Of Regents On Behalf Of Arizona State University | Systems and methods for narrative detection and frame detection using generalized concepts and relations |
US11741145B1 (en) * | 2018-09-30 | 2023-08-29 | Veritas Technologies Llc | Method and system for classification of unstructured data items |
JP2020154395A (ja) * | 2019-03-18 | 2020-09-24 | 富士ゼロックス株式会社 | 情報処理装置及びプログラム |
US12099538B2 (en) * | 2021-10-29 | 2024-09-24 | Galisteo Consulting Group, Inc. | Identifying fringe beliefs from text |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1074210A (ja) * | 1996-07-05 | 1998-03-17 | Hitachi Ltd | 文献検索支援方法及び装置およびこれを用いた文献検索サービス |
JP2003248686A (ja) * | 2002-02-22 | 2003-09-05 | Ricoh Co Ltd | 文書群ラベル生成装置、文書群ラベル生成方法及び記録媒体 |
JP2003345811A (ja) * | 2002-05-27 | 2003-12-05 | Hitachi Ltd | 文書情報表示システム、文書情報表示方法及び文書検索方法 |
WO2009001696A1 (ja) * | 2007-06-22 | 2008-12-31 | Patent Result Co., Ltd. | 情報処理装置、プログラム、情報処理方法 |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06274541A (ja) | 1993-03-17 | 1994-09-30 | Nippon Steel Corp | 文献検索システム |
US5675819A (en) * | 1994-06-16 | 1997-10-07 | Xerox Corporation | Document information retrieval using global word co-occurrence patterns |
US5926812A (en) * | 1996-06-20 | 1999-07-20 | Mantra Technologies, Inc. | Document extraction and comparison method with applications to automatic personalized database searching |
US6484168B1 (en) * | 1996-09-13 | 2002-11-19 | Battelle Memorial Institute | System for information discovery |
JP3607462B2 (ja) | 1997-07-02 | 2005-01-05 | 松下電器産業株式会社 | 関連キーワード自動抽出装置及びこれを用いた文書検索システム |
JP2001014341A (ja) | 1999-07-02 | 2001-01-19 | Ricoh Co Ltd | データベース作成装置および関連文書/関連語検索装置、データベース作成方法および関連文書/関連語検索方法ならびに記憶媒体 |
US6772141B1 (en) * | 1999-12-14 | 2004-08-03 | Novell, Inc. | Method and apparatus for organizing and using indexes utilizing a search decision table |
JP2002032394A (ja) | 2000-07-18 | 2002-01-31 | Ricoh Co Ltd | 関連語情報作成装置、関連語提示装置、文書検索装置、関連語情報作成方法、関連語提示方法、文書検索方法および記憶媒体 |
US7860706B2 (en) * | 2001-03-16 | 2010-12-28 | Eli Abir | Knowledge system method and appparatus |
US6847966B1 (en) * | 2002-04-24 | 2005-01-25 | Engenium Corporation | Method and system for optimally searching a document database using a representative semantic space |
US7003516B2 (en) * | 2002-07-03 | 2006-02-21 | Word Data Corp. | Text representation and method |
JP2004164036A (ja) * | 2002-11-08 | 2004-06-10 | Hewlett Packard Co <Hp> | 文書の共通性評価方法 |
US7630980B2 (en) * | 2005-01-21 | 2009-12-08 | Prashant Parikh | Automatic dynamic contextual data entry completion system |
JP4886266B2 (ja) | 2005-10-11 | 2012-02-29 | 株式会社東芝 | 文献調査方法、文献調査システムおよび文献調査プログラム |
JP4848317B2 (ja) * | 2007-06-19 | 2011-12-28 | インターナショナル・ビジネス・マシーンズ・コーポレーション | データベースのインデックス作成システム、方法及びプログラム |
JP5078674B2 (ja) * | 2008-02-29 | 2012-11-21 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 分析システム、情報処理装置、アクティビティ分析方法、およびプログラム |
-
2009
- 2009-11-06 US US13/129,342 patent/US8892574B2/en active Active
- 2009-11-06 WO PCT/JP2009/005907 patent/WO2010061537A1/ja active Application Filing
- 2009-11-06 JP JP2010540323A patent/JP5594145B2/ja active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1074210A (ja) * | 1996-07-05 | 1998-03-17 | Hitachi Ltd | 文献検索支援方法及び装置およびこれを用いた文献検索サービス |
JP2003248686A (ja) * | 2002-02-22 | 2003-09-05 | Ricoh Co Ltd | 文書群ラベル生成装置、文書群ラベル生成方法及び記録媒体 |
JP2003345811A (ja) * | 2002-05-27 | 2003-12-05 | Hitachi Ltd | 文書情報表示システム、文書情報表示方法及び文書検索方法 |
WO2009001696A1 (ja) * | 2007-06-22 | 2008-12-31 | Patent Result Co., Ltd. | 情報処理装置、プログラム、情報処理方法 |
Non-Patent Citations (1)
Title |
---|
YASUHIRO TAKAYAMA: "Tango no Renso Kankei ni Motozuku Joho Kensaku System", IPSJ SIG NOTES, vol. 99, no. 20, 1 March 1999 (1999-03-01), pages 1 - 8 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012088930A (ja) * | 2010-10-19 | 2012-05-10 | Chuden Cti Co Ltd | 入力情報分析装置 |
US20140365510A1 (en) * | 2013-06-11 | 2014-12-11 | Konica Minolta, Inc. | Device and method for determining interest, and computer-readable storage medium for computer program |
US9607076B2 (en) * | 2013-06-11 | 2017-03-28 | Konica Minolta, Inc. | Device and method for determining interest, and computer-readable storage medium for computer program |
CN103970910A (zh) * | 2014-05-27 | 2014-08-06 | 南京大学 | 一种基于篇章文档的自适应输入法 |
CN103970910B (zh) * | 2014-05-27 | 2017-02-15 | 南京大学 | 一种基于篇章文档的自适应输入法 |
US11392568B2 (en) * | 2015-06-23 | 2022-07-19 | Microsoft Technology Licensing, Llc | Reducing matching documents for a search query |
US20230038616A1 (en) * | 2015-06-23 | 2023-02-09 | Microsoft Technology Licensing, Llc | Reducing matching documents for a search query |
US11748324B2 (en) * | 2015-06-23 | 2023-09-05 | Microsoft Technology Licensing, Llc | Reducing matching documents for a search query |
Also Published As
Publication number | Publication date |
---|---|
JP5594145B2 (ja) | 2014-09-24 |
JPWO2010061537A1 (ja) | 2012-04-19 |
US20110219000A1 (en) | 2011-09-08 |
US8892574B2 (en) | 2014-11-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5594145B2 (ja) | 検索装置、検索方法、及びプログラム | |
JP5531395B2 (ja) | 単語親和度による単語クラスタの識別 | |
Moldagulova et al. | Using KNN algorithm for classification of textual documents | |
US8171029B2 (en) | Automatic generation of ontologies using word affinities | |
JP6782858B2 (ja) | 文献分類装置 | |
US8543380B2 (en) | Determining a document specificity | |
JP7451747B2 (ja) | コンテンツを検索する方法、装置、機器及びコンピュータ読み取り可能な記憶媒体 | |
Karthikeyan et al. | Probability based document clustering and image clustering using content-based image retrieval | |
JPWO2011004529A1 (ja) | 分類階層再作成システム、分類階層再作成方法及び分類階層再作成プログラム | |
EP2045732A2 (en) | Determining the depths of words and documents | |
US9552415B2 (en) | Category classification processing device and method | |
CN113961528A (zh) | 基于知识图谱的文件语义关联存储系统及方法 | |
KR101472451B1 (ko) | 디지털 콘텐츠 관리 시스템 및 방법 | |
KR20180129001A (ko) | 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 방법 및 시스템 | |
CN111143400B (zh) | 一种全栈式检索方法、系统、引擎及电子设备 | |
JP4175001B2 (ja) | 文書データ検索装置 | |
Do Van et al. | Classify high dimensional datasets using discriminant positive negative association rules | |
Murarka et al. | Query-based single document summarization using hybrid semantic and graph-based approach | |
CN117688140B (zh) | 文档查询方法、装置、计算机设备和存储介质 | |
JP5416680B2 (ja) | 文書分割検索装置及び方法及びプログラム | |
CN111694948B (zh) | 文本的分类方法及系统、电子设备、存储介质 | |
US11636167B2 (en) | Determining similarity between documents | |
Toke et al. | Enhancing text mining using side information | |
Jisha et al. | A novel approach for document extraction based on SVD and FCA | |
D'Silva et al. | Improved Algorithms for Document Classification &Query-based Multi-Document Summarization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09828787 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13129342 Country of ref document: US |
|
ENP | Entry into the national phase |
Ref document number: 2010540323 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09828787 Country of ref document: EP Kind code of ref document: A1 |