WO2011070832A1 - 検索キーワードから文書データを検索する方法、並びにそのコンピュータ・システム及びコンピュータ・プログラム - Google Patents

検索キーワードから文書データを検索する方法、並びにそのコンピュータ・システム及びコンピュータ・プログラム Download PDF

Info

Publication number
WO2011070832A1
WO2011070832A1 PCT/JP2010/065631 JP2010065631W WO2011070832A1 WO 2011070832 A1 WO2011070832 A1 WO 2011070832A1 JP 2010065631 W JP2010065631 W JP 2010065631W WO 2011070832 A1 WO2011070832 A1 WO 2011070832A1
Authority
WO
WIPO (PCT)
Prior art keywords
document data
keyword
search
score
data
Prior art date
Application number
PCT/JP2010/065631
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
猛 稲垣
Original Assignee
インターナショナル・ビジネス・マシーンズ・コーポレーション
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by インターナショナル・ビジネス・マシーンズ・コーポレーション filed Critical インターナショナル・ビジネス・マシーンズ・コーポレーション
Priority to KR1020127016208A priority Critical patent/KR101419623B1/ko
Priority to JP2011545111A priority patent/JP5448105B2/ja
Priority to DE201011004087 priority patent/DE112010004087T5/de
Priority to CN201080054742.2A priority patent/CN102640152B/zh
Priority to GB1209093.2A priority patent/GB2488925A/en
Publication of WO2011070832A1 publication Critical patent/WO2011070832A1/ja

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems

Definitions

  • the present invention relates to a method for searching document data from a search keyword, and a computer system and a computer program thereof.
  • document data Due to the widespread use of computer networks and higher performance of computer systems, it has become easier to access enormous electronic document data (hereinafter referred to as document data).
  • document data Due to the widespread use of computer networks and higher performance of computer systems, it has become easier to access enormous electronic document data (hereinafter referred to as document data).
  • the retrieval of document data makes it possible to find out necessary document data from among these enormous document data.
  • Search of document data is performed for various objects. For example, in the case of a search engine in an Internet browser, Web pages on servers or intermediate servers (also referred to as proxy servers) distributed throughout the world are search targets. Further, in a company, document data accumulated in a company activity is a search target.
  • searching document data there is a method of analyzing a correlation between a search keyword and a word or phrase in the document data to find a word or phrase that co-occurs with the search keyword.
  • Non-Patent Document 1 For example, by finding a word or phrase that has a strong correlation with the word “IBM” (registered trademark of IBM), document data highly related to the word “IBM” (registered trademark of IBM) is strongly correlated. It is possible to find out appropriately from search objects based on words or phrases.
  • a document data search method is described in Non-Patent Document 1, for example.
  • the present invention provides a method for appropriately finding a correlation in a wider context when searching document data from a search keyword.
  • the present invention provides a method for searching document data that is correlated with a search keyword or a related keyword associated with the search keyword and is a description of an event in a natural language.
  • the method includes causing a computer to perform the following steps. This step is Calculating, as a first vector, a score or probability that each of the document data belongs to a cluster or class for clustering or classifying the document data; Calculating a score or probability that the search keyword or a related keyword associated with the search keyword belongs to the cluster or class as a second vector in response to the input of the search keyword; Calculating the inner product of each of the first vectors and the second vector, wherein the calculated inner product value is a score for the search keyword of the document data; and Obtaining a correlation value from document data including individual keywords of the classification keyword set and document data in which the score is equal to or higher than a predetermined threshold or the score is included in a higher predetermined ratio; including.
  • the present invention also provides a computer system that retrieves document data that is a description of an event in a natural language, having a correlation with the search keyword or a related keyword associated with the search keyword.
  • the computer system A first calculator that calculates, as a first vector, a score or probability that each of the document data belongs to a cluster or class for clustering or classifying the document data;
  • a second calculation unit that calculates, as a second vector, a score or probability that the search keyword or a related keyword associated with the search keyword belongs to the cluster or class in response to the input of the search keyword;
  • a third calculation unit for calculating an inner product of each of the first vectors and the second vector, wherein the calculated inner product value is a score for the search keyword of the document data;
  • a correlation value calculation unit for obtaining a correlation value from document data including individual keywords of the classification keyword set and document data in which the score is equal to or higher than a predetermined threshold or the score is included in a predetermined ratio of higher ranks; including.
  • the correlation value calculation unit includes a first data set of document data including individual keywords of the classification keyword set, and the score is a predetermined threshold value or higher.
  • a correlation value is obtained from a second data set of document data included in the ratio and a common data set of document data existing in both the first data set and the second data set.
  • the correlation value calculation unit obtains a correlation value, for example, in accordance with a correlation function of Equation 20 below.
  • the correlation value calculation unit includes a first data set of document data including individual keywords of the classification keyword set, and the score is a predetermined threshold value or higher.
  • the correlation value calculation unit obtains a correlation value, for example, according to the correlation function of the following equation (26).
  • the present invention also provides a computer program that causes a computer to execute each step of the method described above.
  • the search method according to the embodiment of the present invention can accurately search for necessary document data.
  • Fig. 5 shows a flow of creation of an index database including clustering or classification of document data.
  • a flow of an outline of retrieval of document data is shown.
  • the outline flow of natural language analysis is shown.
  • An example of the search by collation with the internal representation of document data and the internal representation of a query is shown.
  • the contents of steps 101 to 103 in FIG. 1A are described using a specific example of document data.
  • the conceptual diagram of the 1st aspect of clustering of several document data is shown.
  • the conceptual diagram of the 2nd aspect of clustering of several document data is shown.
  • the conceptual diagram of the 3rd aspect of clustering of several document data is shown.
  • FIG. 4A is the facet keyword A
  • FIG. 4B is a conceptual diagram for calculating a correlation value between the facet keyword A and the search keyword using the facet count of FIG. 4B and detecting a keyword having a strong correlation with the search keyword.
  • FIG. 4B shows a flow for creating an index used to detect a keyword having a strong correlation with a search keyword, using the facet count of FIG. 4B.
  • FIG. 4D shows a flow of correlation detection with a search keyword using the index created in FIG. 4D.
  • the flow of concept search is shown.
  • FIG. 5B is a conceptual diagram of each vector and document product of document data and query in the concept search of FIG. 5A.
  • the conceptual diagram of the document vector in vector space is shown.
  • the conceptual diagram of the search method of the 1st aspect which is an embodiment of this invention is shown.
  • a conceptual diagram for detecting a facet keyword having a strong correlation with a search keyword using a facet count is shown.
  • 6B shows a flow for creating an index used for the search method of the first aspect of FIG. 6A.
  • FIG. 6C shows a flow for detecting correlation using the search method of the first aspect of FIG. 6A using the index created in FIG. 6C.
  • 6C shows a flow for detecting correlation using the search method of the first aspect of FIG. 6A using the index created in FIG. 6C.
  • the concept of the search method of the 2nd aspect which is an embodiment of this invention is shown.
  • FIG. 7B shows a flow for detecting correlation using the search method according to the second aspect of FIG. 7A using an index created in the same manner as FIG. 6C.
  • FIG. 7B shows a flow for detecting correlation using the search method according to the second aspect of FIG. 7A using an index created in the same manner as FIG. 6C.
  • FIG. 1 shows a system diagram of a computer including a document data processing unit and an indexing unit according to an embodiment of the present invention.
  • FIG. 1 shows a system diagram of a search server having a search unit according to an embodiment of the present invention.
  • FIG. 9 shows a block diagram of computer hardware included in each system of FIGS. 8A and 8B in the embodiment of the present invention.
  • the search result according to the keyword search method, the search method of the 1st aspect which is this invention, and the search method of the 2nd aspect which is this invention is shown.
  • document data is a description of an event in natural language.
  • document data is an electronic occurrence of an event that occurred in the real world.
  • An event is also called an event.
  • the document data is digitized and machine readable.
  • the document data may have a text segment.
  • the document data may also be a piece of data that can be specified using the subject as a key.
  • the document data is, for example, a web page on the Internet, a company product accident report, a telephone reception report, a news, a technical document, or the like, but is not limited thereto.
  • One document data may not correspond to one physical data file. That is, the document data can physically be a part or all of one data file.
  • One physical data file may contain a collection of document data.
  • the document data can be stored as a data file in, for example, a server storage device or a network storage device connected via a network, such as a document database in a storage area network (SAN).
  • the storage format is not particularly limited, but may be described in plain text, HTML, or XML.
  • Document data is collected, for example, periodically or centrally by a crawler (805 in FIG. 8A) from various servers.
  • the collection of document data refers to a data set including one or more document data.
  • the collection of document data can physically be a part or all of one or more data files.
  • search keyword refers to at least one word (hereinafter also referred to as “search word”) for document data search, and at least one phrase (hereinafter also referred to as “search phrase”). ), Or a combination thereof.
  • Search keywords may be entered by the user or automatically entered by a computer.
  • a word is the smallest linguistic unit having linguistic sound, meaning and grammatical function, and its part of speech is not particularly limited.
  • a phrase is generally two or more words arranged grammatically and serves as one unit in a sentence. In particular, in English, a phrase is a sequence of two or more words, and includes a phrase that does not include a definite verb and its subject but acts as one part of speech.
  • the input of the search keyword by the user can be performed by inputting one or more words, one or more phrases, or a combination thereof, for example, in a search keyword input field on a browser or an application.
  • the input of the search keyword by the computer is performed by detecting one or more words, one or more phrases, or a combination thereof based on, for example, web contents that the user is browsing.
  • the input search keyword is automatically converted into a query (for example, SQL), for example.
  • the converted query is transmitted to a search server or an intermediate server.
  • the “related keyword associated with the search keyword” is at least one word, at least one phrase, or a combination thereof that is strongly related to the search keyword, for example, co-occurring with the search keyword
  • the keyword to do Co-occurrence means that at least two arbitrary keywords appear simultaneously in certain document data.
  • the co-occurrence correlation indicates the strength of relevance between keywords.
  • the related keyword is extracted in advance based on an arbitrary keyword in the document data, for example, a keyword with high appearance frequency, and is registered in, for example, a dictionary.
  • the arbitrary keyword is input as a search keyword by the user in the future. Then, a related keyword of the search keyword is selected from the dictionary.
  • the co-occurrence correlation of words is obtained according to the following equation 1 when there is a set of document data in which the word A appears and a set of document data in which the word B appears, for example.
  • Document data in which the word A appears is synonymous with document data including the word A.
  • the document data in which the word B appears is synonymous with the document data including the word B.
  • the co-occurrence correlation of the word is obtained according to the following expression 2 by replacing the word A with a plurality of words ⁇ 1, 2,...
  • ⁇ 1 is a co-occurrence of keyword 1 and a word
  • ⁇ 2 is a co-occurrence of keyword 2 and a word
  • ⁇ n is a co-occurrence of keyword n and a word.
  • the value is 1 when there is no correlation.
  • a set of document data for obtaining co-occurrence correlation is also called a corpus.
  • the set of corpus document data may be the same as or different from the set of document data to be searched.
  • the corpus is a set of document data that is in the same language as the set of document data to be searched and belongs to the same or similar field as the document data to be searched. It is desirable. For example, when the document data to be searched is a car accident, the corpus may be report data that summarizes the accident report.
  • clustering refers to grouping document data based on the similarity or distance between the document data.
  • a group formed by clustering a set of document data or a group for clustering a set of document data is also called a cluster.
  • document data is a set of keywords
  • clustering a set of document data is also clustering keywords.
  • the similarity between document data is a real number determined for two pieces of document data when the two pieces of document data are given. It is assumed that the larger the value, the more similar the two document data.
  • a cluster can be created by grouping high-similarity document data.
  • the distance between document data refers to the distance between two document data in the Euclidean space, and is the distance between the two document data. To define a distance, you must first define a space. As a space, each word is defined as each dimension, and the document data is plotted as the number of words appearing in the document data or as a point having tf ⁇ idf as coordinates.
  • the distance between the two document data is simply calculated by the distance between the two points in the Euclidean space.
  • a cluster can be created by grouping document data that are close to each other.
  • the tf ⁇ idf is obtained by multiplying the number of appearances tf of words in the document data by the reciprocal of the number of document data including the word or the reciprocal of the logarithm so as to reduce the contribution from the frequent word.
  • similarity between document data is the reciprocal of distance.
  • the similarity itself does not need to be defined as a distance in the Euclidean space as long as the magnitude relationship can be maintained.
  • the similarity can be defined as the center angle of two points on the sphere, so that various variations of the similarity are possible.
  • Step 1 The analysis processing unit (806 in FIG. 8A) performs morphological analysis on the document data and divides it into each keyword.
  • a method of morphological analysis there are a method using a morphological analysis by a rule and a probabilistic language model.
  • the morphological analysis using the probabilistic language model is, for example, a method using a hidden Markovnikov model.
  • a commercially available morphological analysis engine may be used.
  • a vector composed of weights of keywords constituting a document is often used to express document data. In order to express the vector, it is necessary to separate keywords in the document data.
  • the clustering unit (806 in FIG. 8A) represents the document data as a vector composed of the weight of each word.
  • tf ⁇ idf is obtained based on two indices, tf (word appearance frequency) and idf (reverse appearance frequency).
  • Each vector has a length of d i is normalized to be 1.
  • the vector is shown on the vector space model.
  • the vector space model is a search model that expresses document data using the vector.
  • Step 3 The clustering unit defines the similarity or distance between document data.
  • similarity or distance For clustering document data, it is necessary to define similarity or distance as an index of how similar the document data is.
  • the method for the definition differs depending on the document data to be clustered.
  • a method of obtaining similarity or distance (1) hierarchical clustering, (2) non-hierarchical clustering (k-means), (3) dimensional reduction method such as principal component analysis, (4) method based on probability model And (5) a method based on graph theory.
  • the method for obtaining the similarity or distance can be appropriately selected according to the document data to be clustered, but is not limited to the methods (1) to (5).
  • the document vectors of the document data D i and D j are set as d i and d j , respectively. Similarity s between these document data (d i, d j) may be represented by the cosine of the angle formed by the following d i and d j as shown in the following formula 8.
  • Step 4 The clustering unit performs clustering using the similarity.
  • the clustering unit sets each piece of document data as one cluster.
  • Step 2 The clustering unit obtains a cluster set having the maximum similarity from the cluster set.
  • Step 3 If the degree of similarity between the cluster sets is equal to or less than the threshold, the clustering unit ends clustering. On the other hand, if the degree of similarity between the groups of clusters is equal to or greater than a threshold value, the clustering unit integrates the clusters of the groups into one cluster. The clustering unit returns to Step 2 and repeats Step 2 and Step 3.
  • the similarity between sets of clusters can be obtained, for example, by the longest distance method shown in Equation 9 below.
  • the similarity between the clusters G i and G j includes the similarity between the document vector x of any document data belonging to G i and the document vector of any document data belonging to G j.
  • the minimum similarity is set as the similarity between the clusters.
  • the k-means method can be used.
  • clusters can be formed by the following algorithm. Assume that the number of cluster divisions is k, the number of document data is m, and the number of document data included in the cluster W is N (W). Step 1 The clustering unit arbitrarily determines k initial clusters. Step 2 The clustering unit calculates an error increase amount e (i, W) when the document data D i obtained according to the following equation 10 is moved to the cluster W, and the document data D i is stored in the cluster W having the minimum value. Move.
  • D (i, W) is the distance between the document data D i and the cluster W, and is defined by the following equations 11-12.
  • Step 3 The clustering unit ends if there is no movement of document data from one cluster to another. On the other hand, the clustering unit returns to Step 2 if the document data is moved.
  • LSA latent semantic analysis
  • LSI latent semantic index
  • a document-word matrix that represents the appearance frequency of keywords in each document data is used.
  • the document-word matrix uses a sparse matrix, with columns corresponding to words or phrases and rows corresponding to document data.
  • the weighting of each component of the matrix for example, the above-mentioned tf ⁇ idf is used.
  • LSA converts this sparse matrix into a relationship between words or phrases and some concept and between document data.
  • the keyword is indirectly associated with the document data through the concept.
  • An example of implementation of LSA is InfoSphere (trademark of IBM) Classification) Module of IBM (registered trademark of IBM) based on the Boosting algorithm, for example.
  • clustering without expressing document data as vectors.
  • the similarity between document data can be obtained by measuring the number of co-citations or bibliographic combination. If the degree of similarity can be defined, clustering can be performed by an appropriate clustering method.
  • classification refers to grouping document data automatically by a computer or led by human.
  • a group formed by classifying a set of document data or for classifying a set of document data is also called a class.
  • the classification is performed according to, for example, a model expressed by an equation, a model expressed by a rule, a model expressed by a probability, and a model that performs matching.
  • the model expressed by the equation is, for example, discriminant analysis.
  • the model expressed by rules is, for example, a rule base or a decision tree.
  • the model expressed by the probability is, for example, a Bayesian network.
  • the model for performing matching is, for example, a self-organizing map.
  • the document data clustering belongs to unsupervised learning in which a class (also referred to as a label) is not given to each target document data, and a class is defined from a keyword of the prepared document data. That is, the clustering is not performed by defining a class in advance, but by defining a data space and grouping by existing similarity or distance between document data.
  • “classification” of document data belongs to supervised learning in which a class is assigned to each target document data. In classification, attention is paid to one keyword (object variable, teacher signal) in document data.
  • the score or probability that each of the document data belongs to a cluster or class refers to the score or probability that each of the document data belongs to the cluster or class. Probability is shown as 0-100%.
  • the score is represented by a score, for example.
  • the “first vector” represents a score or probability that the document data belongs to the cluster or the class as a vector.
  • the first vector may be indicated by a value obtained by converting the score or probability into a real number of 0 to 1, for example.
  • the “second vector” represents a score or probability that the search keyword or the related keyword belongs to the cluster or the class as a vector.
  • the second vector may be indicated by a value obtained by converting the score or probability into a real number of 0 to 1, for example.
  • the “second vector” is a parameter that is evaluated regardless of the document data prepared in the document database.
  • the “inner product” is an operation for determining a certain number (scalar) for each first vector for each document data and for each second vector for a search keyword.
  • the inner product value of the vector of 1 and the second vector is obtained according to the following equation (13). An inner product value is obtained for each first vector.
  • the “correlation value” is used to detect a facet keyword having a strong correlation with the search keyword, or a facet keyword having a strong correlation with the search keyword and document data obtained as a result of the concept search. Used to detect. The higher the correlation value, the stronger the correlation with the search keyword.
  • the search method of the first aspect is performed by obtaining the correlation value of the first aspect.
  • the correlation value of the first aspect includes the first data set of document data including individual keywords of the classification keyword set, and the document data in which the score is equal to or higher than a predetermined threshold or the score is included in the upper predetermined ratio. It is obtained from the second data set and a common data set of document data existing in both the first data set and the second data set. In this case, the correlation value is calculated using a correlation function.
  • the correlation function is obtained, for example, according to the correlation function of Equation 20 below.
  • the search method of the second aspect is performed by obtaining the correlation value of the second aspect.
  • the correlation value of the second aspect includes the first data set of the document data including individual keywords of the classification keyword set, and the document data whose score is equal to or higher than a predetermined threshold or whose score is included in the upper predetermined ratio. Common data of document data existing in both the second data set, the third data set of document data including the search keyword or the related keyword, and the first data set and the second data set Required from set.
  • the correlation value is calculated using a correlation function.
  • the correlation function is obtained, for example, according to the correlation function of the following Expression 26.
  • the “classified keyword set” is also called a facet keyword.
  • a facet is a point of information.
  • a plurality of attribute values (facet values) are used as metadata.
  • the classification keyword set is a plurality of facet keywords (also simply referred to as facet keywords).
  • facet keywords can be selected from words or phrases in a dictionary, for example, by a user.
  • the facet keyword is selected by, for example, causing the user to select a facet keyword displayed on the tree on the application. Facet keywords need not be selected directly by the user, but may be automatically selected by a computer.
  • “document data including individual keywords of the classification keyword set” is document data (first document data) including facet keywords in a set of document data.
  • the “first data set of document data including individual keywords of the classification keyword set” is an aggregate composed of the first document data.
  • “document data whose score is equal to or higher than a predetermined threshold or whose score is included in the upper predetermined ratio” is an inner product of a first vector and a second vector in a set of document data.
  • the “second data set of document data whose score is equal to or higher than a predetermined threshold or whose score is included in the upper predetermined ratio” is an aggregate composed of the second document data.
  • “document data including a search keyword or related keyword” is document data including a search keyword or document data (third document data) including a search keyword in a set of document data.
  • the “third data set of document data including a search keyword or a related keyword” is an aggregate including the third document data.
  • a common data set of document data existing in both the first data set and the second data set refers to the first data set and the second data set. It is a collection of book data that exists in both.
  • FIG. 1A to 5D The techniques used in the embodiments of the present invention will be described below with reference to FIGS. 1A to 5D, and then the embodiments of the present invention will be described with reference to FIGS. 6A to 10.
  • FIG. 6A to 10 FIG. It should be understood that this embodiment is for the purpose of illustrating a preferred embodiment of the invention and is not intended to limit the scope of the invention to what is shown here. Also, throughout the following drawings, the same reference numerals refer to the same objects unless otherwise specified. In addition,
  • FIG. 1A shows the flow of creating an index database, including document data clustering or classification.
  • A. Creation of Index Database In creation of the index database (114), words and phrases are extracted from the document data (111) (101), the document data is clustered or classified (102), and the document including the search keyword from the search keyword is created. An index for specifying data is created (103). As an index, words or phrases are indexed into the document data. The score or probability that the document data belongs to the cluster or class is given to the document data as metadata.
  • the analysis processing unit (806 in FIG. 8A) that processes the natural language analysis in the computer performs natural language analysis of the prepared document data (111).
  • the document data (111) is stored in, for example, a storage device that stores a document database or another recording medium.
  • Natural language analysis is composed of, for example, the following four steps as shown in FIG. 1C: morphological analysis (121), syntactic analysis (122), semantic analysis (123), and context analysis (124). Natural language analysis may be performed using, for example, commercially available natural language analysis application software.
  • the natural language analysis engine is implemented as a part of IBM (registered trademark of IBM Corporation) OmniFind (registered trademark of IBM Corporation) Enterprise Edition.
  • the analysis processing unit extracts a word from the document data (111).
  • a word dictionary (112) and a word extraction rule (113) are used.
  • the word dictionary (112) is a dictionary used for extracting words from document data by natural language analysis.
  • the word dictionary for example, a dictionary in the field of the contents of the document data or a field similar to the contents can be used.
  • the extraction rule (113) is a rule or rule used for extracting words from document data by natural language analysis. In natural language analysis, part-of-speech information is further added to the extracted word using a word dictionary (112).
  • the analysis processing unit may further extract a phrase based on the word to which the part of speech information is added and the extraction rule (113).
  • the clustering unit in the computer clusters or classifies the document data (111) stored in the document database.
  • document data clustering document data is grouped based on the similarity or distance between the document data.
  • classification of document data the document data is grouped automatically by a computer or manually by a person.
  • clustering or classification a score for each cluster is obtained for each document data.
  • the method for obtaining the score differs depending on the clustering or classification algorithm. In other words, the algorithm defines the score. For example, in the method of mapping to the word space, each cluster is defined as a point representing the cluster in the word space, and each document data is also defined as a point. In the case of the mapping method, the reciprocal of the distance between points can be used as a score. Another method is to consider a point as a vector and define it by an inner product.
  • step 103 the indexing unit (807 in FIG. 8A) in the computer replaces the document data to be searched with a format that can be processed internally (internal representation) (see 132 in FIG. 1D). This internal representation is also called an index.
  • the indexing unit stores words or phrases (hereinafter collectively referred to as “keywords”) extracted from the document data (111) as an index in the index database (114).
  • the indexing unit creates a document list having a score for each cluster for each document data.
  • the partial document list may be stored in the index database (114), or may be stored in the recording medium as another database.
  • storing all the scores of each cluster in each document data is redundant and increases the data amount. Therefore, a score having a score larger than a predetermined threshold or a score of a cluster having a predetermined ratio may be stored in the document list, and scores of other clusters may be regarded as zero.
  • the index database (114) is created. By creating the index database (114), it is possible to search document data using the index database (114) based on a search keyword input by a user or created by a computer. An outline of document data search is shown in FIG. 1B.
  • the index can be automatically created by a computer according to the efficiency or purpose of the search, as described in step 103 above. However, the index is manually created according to the efficiency or purpose of the search. You may create it. It is important that the index is a good representation of the contents of the document data for use in matching against queries entered by the user or created by a computer.
  • An automatic index creation by a computer automatically extracts words from document data to be searched, automatically associates the index with the document data based on the part of speech or statistical information of the words, This is a method for registering search keywords in an index database.
  • Japanese since words are not written like English, it is necessary to automatically divide words in document data.
  • the morphological analysis can be used for the division.
  • function words such as particles and auxiliary verbs may be deleted from the index, and only meaningful content words such as independent words may be registered in the index database (114) as indexes.
  • An n-gram index can also be used for automatic index creation.
  • an n-gram index may be used in which consecutive n characters are indexed while being shifted one character at a time from the beginning of a sentence instead of divided words.
  • an n-gram index also creates a meaningless index.
  • the importance can also be used for automatic index creation.
  • the importance level indicates how closely the index extracted from the document data is related to the contents of the document data including the index. By assigning the degree of importance to the search keyword, more appropriate document data can be searched.
  • the importance of document data can usually vary depending on the document data containing the search keyword. For example, tf ⁇ idf is used as the importance calculation method.
  • tf is the frequency of appearance in the document data of the index, and it is determined that the more a certain keyword exists in the document data, the more important it is. That is, an index with a higher appearance frequency value is more important in the document data.
  • df is the number of document data in which an index appears in a set of document data
  • idf is the reciprocal thereof.
  • tf ⁇ idf in a set of document data, when a specific keyword appears frequently in a specific document data, the keyword is important, but on the other hand, which document data is included in the set of document data. If it also appears, it represents the property that the importance is lowered. This property can be used to weight the index. Using the weighting by tf ⁇ idf, document data having a high importance of a given search keyword can be extracted as a search result preferentially.
  • the manual index creation is, for example, a method in which a database administrator looks at the contents of document data, extracts words or phrases that are considered important for search, and uses the words or phrases as an index.
  • FIG. 1B shows a general flow of document data search.
  • B. Searching Document Data Searching document data is performed using the index database (114) created in FIG. 1A (104). The document data search will be described below in accordance with steps 104 to 105 in FIG. 1B.
  • the search server receives the query (115) input to the user terminal via, for example, a network.
  • the query is expressed as a search expression, for example.
  • the search formula has a search keyword.
  • the indexing unit in the search server replaces the query with a format that can be processed inside the system (internal representation) (see 131 in FIG. 1D).
  • the search unit in the search server accesses the index database (114) and collates the internal representation (131) of the query with the internal representation of the document data (132 in FIG. 1D) (see 104 in FIG. 1D). A search result satisfying the above query can be obtained.
  • the search server transmits the search result to the user terminal.
  • the user terminal displays the search result on the display device.
  • the search result for example, document data is displayed as a list, and preferably, the list of document data is displayed in descending order of correlation value with the query. Note that the search server and the user terminal may be the same.
  • the indexing unit of the user terminal can process the query inside the system (internal representation). (See 131 in FIG. 1D).
  • the search unit in the user terminal accesses the index database (114) and obtains a search result that satisfies the above query.
  • the user terminal displays the search result on the display device.
  • FIG. 1C shows a general flow of natural language analysis. Natural language analysis is performed by an analysis processing unit.
  • the analysis processing unit includes a morphological analysis processing unit, a syntax analysis processing unit, a semantic analysis processing unit, and a context analysis processing unit.
  • the morpheme analysis processing unit morphologically analyzes the clause of the document data (111). In morphological analysis, utilization is used as usage information. In morphological analysis, terms such as terms are returned to their original form, and parts of speech are assigned to all the words extracted from the document data. In the morphological analysis, for example, only results in which the word arrangement in the clause is morphologically correct can be used.
  • the syntax analysis processing unit performs syntax analysis using the result of the morphological analysis.
  • the grammar of each language according to the document data is used as usage information, and the syntactic structure is extracted.
  • the syntactic structure is a regular array structure of sentences.
  • a grammatical structure is used to analyze a modification relationship between words or phrases, thereby obtaining a syntactic structure of a sentence.
  • the semantic analysis processing unit extracts the meaning of the word, phrase, or sentence using a dictionary having semantic information possessed by the word or phrase.
  • ambiguity generated by morphological analysis and syntactic analysis is resolved.
  • the context analysis processing unit performs context analysis using the result of semantic analysis.
  • FIG. 1D shows an example of a search by matching the internal representation of document data with the internal representation of a query.
  • the search unit collates the internal representation (131) of the query created from the query (115) with the internal representation (132) of each document data, and searches for document data that matches the query.
  • the search unit displays the search result of the matched document data on the display device in a list format, for example.
  • FIG. 2 explains the contents of steps 101 to 103 in FIG. 1A using a specific example of document data.
  • a specific example of document data (211) will be described using original document data 1 (221).
  • the contents of the original document data 1 (221) are as follows: “The steering wheel was pushed hard to avoid the oncoming vehicle. Date 2007/07/07 7:00 AM”.
  • Steps 201 to 203 correspond to steps 101 to 103 in FIG. 1A, respectively.
  • the analysis processing unit (806 in FIG. 8A) performs natural language analysis of the original document data 1 (221). The results are as follows: "Oncoming car (noun) Avoid (verb) Handle (noun) Strong (adjective) Can (verb)" (222).
  • the clustering unit (806 in FIG. 8A) clusters or classifies the document data (211).
  • Each of the clusters 1 to 3 shown in FIG. 2 is obtained by clustering words having many common words by referring to all words included in the document data.
  • the scores of clusters 1 to 3 are shown (223).
  • Each score (223) of the clusters 1 to 3 indicates a score to which the original document data 1 belongs to each cluster. For example, when the set of document data relates to a traffic accident report, cluster 1 is “accident due to driving mistake”, cluster 2 is “accident due to engine failure”, and cluster 3 is “accident due to rain”.
  • each cluster is not only made up of specific words or phrases. For example, when the set of document data is related to the traffic accident report, for example, “handle” and “brake” appear as prominent words in the “accident due to driving mistake”, and “accident caused by engine failure”. In cluster 2 of FIG. 2, for example, “oil” and “gasket” appear as prominent words.
  • the indexing unit (807 in FIG. 8A) stores the word or phrase (224) that is the index of the document data (201) in the index database (214).
  • the index database (214) may also store a document list (225).
  • the word / phrase index (224) has words or phrases extracted from the document data (111) as an index.
  • the word / phrase index (224) may also have a date as a document data creation date as an index.
  • the date is not essential as an index of the document data, but anything other than words or phrases that can be used as document data metadata such as the creation date of the document data can be indexed.
  • the creation date of the document data is useful, for example, when it is desired to analyze the document data created within a specific period with a search target narrowed down.
  • the document list (225) records a score in each cluster for each document data.
  • the score of each cluster for each of the original document data 1 to n is stored.
  • FIG. 3A shows a conceptual diagram of a first mode of clustering of a plurality of document data.
  • the first mode is a method of clustering by regarding the appearance frequency of each word as an independent dimension and plotting document data on a vector space model.
  • the vector space model is also called a word space.
  • the appearance frequency of word 1 is indicated on the X axis
  • the appearance frequency of word 2 is indicated on the Y axis
  • the appearance frequency of word 3 is indicated on the axis indicated by the arrow. Therefore, the vector space model shown in FIG. 3A is three-dimensional. Note that if there are N words (N is an integer), there are N dimensions. If there are 100 million words, N is 100 million, so the vector space model is 100 million dimensions.
  • Step 1 Given the document data, specify the language of the document data (for example, Japanese, English or Chinese) from the attributes specified in the document data or the character code used in the document data. To do.
  • Step 2 Using the dictionary for the specified language, morphological analysis is performed to cut out all words or phrases in the document data. Also, words that are not in the dictionary are taken out as unknown words. Thereby, a list of a list of words or phrases included in the document data and an appearance frequency of each word or phrase is created for one document data.
  • Step 3 A union of word lists for each document data is obtained using the above list. Each word in this list becomes the dimension of the vector space model shown in FIG. 3A.
  • the vector space model has 100 million dimensions.
  • Each document data is plotted in the model as a point of the vector space model.
  • Step 4 Each document data is clustered based on the distance between the plotted points. Thereby, groups of document data within a certain range are clustered.
  • the clustering algorithm for example, LSI / LSA, LDA (Latent Dirichlet Allocation), or k-means can be used.
  • the document data is grouped into groups 1 to 4 (301 to 304) by clustering.
  • FIG. 3B shows a conceptual diagram of a second mode of clustering of a plurality of document data.
  • the second mode is a method of clustering document data based on the appearance frequency of common words.
  • the appearance frequency of word 1 is the X axis
  • the appearance frequency of word 2 is the Y axis
  • the appearance frequency of word 3 is indicated by an arrow,.
  • the appearance frequency is shown on the N-axis (not shown). For example, when “cell”, “DNA”, and “acid” are cut out as words, “cell”, “DNA”, and “acid” are the words I, J, and K, respectively (0 ⁇ I, J , K ⁇ N).
  • each document data is plotted in the vector space model.
  • the document data is not necessarily divided by field.
  • the reason why clustering is possible using the second aspect is as follows. As described above, if a total of 100 million words or phrases can be extracted, for example, the vector space model has 100 million dimensions. However, when document data is plotted as dots in a 100 million-dimensional vector space, it becomes quite sparse. Since document data dealing with the same topic is likely to contain a common word, each document data dealing with the same topic is likely to be unevenly distributed in a certain space. For example, in bio-related document data, there are few words or phrases that are mentioned in relation to video technology.
  • the document data is referred to as group 1 (cells, DNA, acid,%), Bio-related document data set (311), and group 2 (video, video recording, MPEG,).
  • Document data collection (312) related to moving image technology group 3 (electrons, transistors, charges, etc, Electronic document data collection (313) and group 4 (liquid, fluid, valve,). It is grouped into a collection (314) of document data related to control technology.
  • FIG. 3C shows a conceptual diagram of a third mode of clustering of a plurality of document data.
  • the third aspect is a method of clustering document data based on the appearance frequency of each word at the center of gravity of each cluster.
  • the appearance frequency of word 1 is the X axis
  • the appearance frequency of word 2 is the Y axis
  • the appearance frequency of word 3 is the direction of the arrow,.
  • the frequency is shown on the N-axis (not shown). If a group (cluster) of points in the vector space model is regarded as a set of mass points with weight, the center of gravity exists. This centroid is the centroid of the cluster.
  • each point may have a uniform weight, or each point may be weighted using tf ⁇ idf.
  • the definition of the center of gravity is the average of the coordinates of each mass point.
  • document data is plotted in the vector space model.
  • how to perform clustering differs depending on what algorithm is used. As the algorithm, conventional methods known to those skilled in the art can be used. In FIG.
  • the document data is divided into group 1 (cell, DNA, acid,%) (321), group 2 (moving picture, video recording, MPEG,...) (322), group 3 (electronic, Transistor, charge, etc (323) and group 4 (liquid, fluid, valve,%) (324).
  • FIG. 4A shows a conceptual diagram for extracting duplicate document data datasets from the document data dataset containing the search keyword B and the document data dataset containing the keyword A.
  • FIG. 4A shows all document data D (401), a set of document data (402) including the search keyword B, and a set of document data (403) including the keyword A.
  • a part of the document data set (402) including the search keyword B and a part of the document data set (403) including the keyword A are common (406).
  • the common part (406) is a common part of a set of document data including the search keyword B and a set of document data including the keyword A.
  • the part (404) of the circle (402) is a set of document data including the search keyword B and is a part not including the common part (406).
  • a portion (405) of the circle (403) is a set of document data including the keyword A and is a portion not including the common portion (406).
  • the correlation function F is used to know whether the number of document data in the common part (406) is larger (above 1) or smaller (below 1) than expected.
  • the correlation function F is expressed using the codes used in FIG. 4A, the correlation function F is obtained according to the correlation function F of the following formula 14 or formula 15.
  • the fact that the correlation value obtained according to the correlation function F is larger than the expected value indicates that there is a correlation (causal relationship) between the search keyword B and the keyword A, and they are related to each other.
  • FIG. 4B shows the relationship between the document data dataset including the facet keyword and the document data dataset including the search keyword when the keyword A in FIG. 4A is the facet keyword A.
  • Facet count is a standard method for keyword search. Counting is counting the number of document data. The facet count will be described using a familiar example. For example, when a product name is input and searched on an Internet shopping site, the number of corresponding products is displayed by price range or manufacturer. As a result, it is possible to obtain knowledge such as how much the product is sold at a certain price or which manufacturer sells a large number of the product.
  • each element (keyword) of a set of keywords A (hereinafter also referred to as “facet keyword A”) designated as a facet keyword with respect to a set of document data including a given search keyword B ) Is counted.
  • facet keyword A a facet keyword with respect to a set of document data including a given search keyword B
  • FIG. 4B when the facet keyword A is changed variously, how often each element (keyword) of the changed facet keyword A appears in the document data set including the search keyword B. Is included in the document data.
  • the document data set (402) including the search keyword B in FIG. 4B is the same as the document data set (402) including the search keyword B in FIG. 4A.
  • a set of document data including facet keyword A is indicated by circles (403A, 403B, and 403C).
  • a set of document data (403A, 403B, and 403C) including facet keyword A in FIG. 4B corresponds to a set of document data (403) including keyword A in FIG. 4A.
  • FIG. 4B only three circles are shown for the sake of space. If there are N facet keywords (N is an integer), there are N circles (403A to 403N).
  • Facet keywords A are, for example, a 1 , a 2 , and a 3 (... A n ).
  • a circle (403A) is a collection of document data including a 1
  • a circle (403B) is a collection of document data including a 2
  • a circle (403C) is a set of document data including a 3 is there.
  • FIG. 4B shows that by changing the facet keyword A, the centers (407A, 407B, and 407C) of the circles (403A, 403B, and 403C) move.
  • a common part (406A, 406B and 406C) of the document data set (402) including the search keyword B and each document data set (403A, 403B and 403C) including the facet keyword A is obtained. You can see how it moves (on the right side of FIG. 4B).
  • the movement of the common part means that the document data in the common part and the number (appearance frequency) of the document data included in the common part change.
  • the correlation value between the facet keyword and the search keyword it is possible to extract a set of facet keywords having a strong correlation with the search keyword B.
  • FIG. 4C is a conceptual diagram for calculating a correlation value between the facet keyword A and the search keyword using the facet count of FIG. 4B and detecting a keyword having a strong correlation with the search keyword.
  • the facet keywords A 1 to 4 are MPEG (411), Fourier transform (412), organic EL (413), and hologram (414), respectively.
  • the facet keyword is, for example, a special noun included in document data.
  • the horizontally long rectangles (411, 412, and 413) shown in FIG. 4C correspond to, for example, 403A, 403B, and 403C in FIG. 4B, respectively. In FIG. 4B, only a circle corresponding to 414 in FIG. 4C is not described.
  • a dotted arrow (415) indicates that the facet keyword A is changed with respect to the search keyword B.
  • the correlation value with the search keyword B in the query is obtained, for example, according to the correlation function corr regular (s, t) of Expression 16 below and Expressions 17 to 19.
  • corr regular s, t
  • Expressions 17 to 19 the correlation function corr regular (s, t) of Expression 16 below and Expressions 17 to 19.
  • whether the document data is included or not included in the document data set (616) obtained as a result of the concept search is 0 or 1 (included / not included). Represented by value.
  • corr regular (s, t) is a value obtained by dividing P regular (s ⁇ t) by P regular (s) ⁇ P regular (t).
  • P regular (s) is obtained according to Equation 17.
  • P regular (t) is obtained according to Equation 18.
  • P regular (s ⁇ t) is obtained according to Equation 19.
  • s is a facet keyword.
  • d is document data.
  • ⁇ s, d is 1 when the facet keyword s is included in the document data d, and 0 otherwise.
  • N is the total number of document data. Therefore, P regular (s) is a value obtained by dividing the total number of scores of the facet keyword s included in the document data d by the total number N of document data.
  • t is a search keyword.
  • ⁇ t, d is 1 when the search keyword t is included in the document data d, and 0 otherwise. Therefore, P regular (t) is a value obtained by dividing the total number of scores of the search keyword t included in the document data d by the total number N of document data.
  • P regular (s ⁇ t) is a value obtained by dividing the total number of scores included in the document data d by both the search keyword t and the facet keyword s by the total number N of document data.
  • corr regular (s, t) is statistically 1 when the search keyword t is included in the document data and the facet keyword s is included in the document data, and is a co-occurrence relationship. It takes a value larger than 1 when there is.
  • the facet keyword A and the search keyword B are applied to the above equations 16 to 19 to calculate corr regular (s, t) and the facet keyword A is changed variously, the correlation value is large.
  • the facet is a keyword having a strong correlation with the search keyword B.
  • FIG. 4D shows a flow for creating an index that is used to detect keywords that are highly correlated with the search keyword, using the facet count of FIG. 4B.
  • index creation is started.
  • an analysis processing unit (806 in FIG. 8A) or an indexing unit (807 in FIG. 8A) is used, but is not limited thereto.
  • the analysis processing unit reads the document data (431) from the storage device into the memory.
  • the analysis processing unit detects the language of the document data from the attribute specified in the document data or the character code used in the document data, using the dictionary or the dictionary (432) including the facet definition. .
  • the analysis processing unit performs morphological analysis using the specified language dictionary (432) to detect all words or phrases in the document data. For example, when there is a word “Japan Patent Office”, the dictionary (432) is used to decompose it into “Japan” and “Patent Office”. However, if there is no dictionary, it is unclear whether it is separated by “Japan” or “Japanese special”.
  • the facet definition is for the purpose of ignoring other words because only those words of particular interest that are specified are defined as facet keywords.
  • the indexing unit indexes the document data using the detected word or phrase as an index.
  • the index is stored in, for example, an index database (433).
  • the indexing unit may also store the weight of the detected word or phrase as metadata with the index.
  • steps 422 to 425 are repeated for all the document data (431), and the index creation is completed.
  • FIG. 4E shows a flow of correlation detection with a search keyword using the index created in FIG. 4D.
  • the search unit starts correlation detection.
  • the search unit receives a search keyword t input from a user or created by a computer and stores it in a memory or a storage device.
  • the search keyword is one or more words, one or more phrases, or a combination thereof (may be a long document), and is incorporated in, for example, SQL.
  • the search unit acquires a list A of all document data using the index in the index database (433). In the index, all words or phrases are arranged in a lexicographic manner, and a list of document data including the words or phrases can be obtained.
  • the list A includes, for example, an identifier (ID) for identifying document data and information on a location where the corresponding word or phrase appears in the document data.
  • ID an identifier
  • the length of the list A is known.
  • the length of the list A represents the number of all document data (431).
  • the search unit acquires a list B of document data including the search keyword t using the index in the index database (433).
  • the search unit obtains a list B [s] of document data including the keyword s for each facet keyword using the index in the index database (433). All keywords defined as facets are used as facet keywords.
  • step 446 the search unit calculates a correlation value for each facet / keyword from the length of the list A, the length of the list B, and the length of the list C [s].
  • the correlation value is obtained, for example, according to the above equations 16 to 19. The higher the correlation value, the more strongly the keyword is correlated with the search keyword.
  • step 447 the search unit ends the correlation detection.
  • 5A to 5D illustrate concept search.
  • concept search document vectors for document data stored in a document database are prepared in advance. Then, when the concept search is actually performed, an inner product between the document vector obtained by analyzing the search keyword input as a query and each of the document vectors prepared in advance is obtained. As a result, document data having a high correlation with the search keyword input as a query is extracted from the document database. The extracted document data is displayed as a list in descending order of correlation.
  • Concept search has the following advantages. (1) Search by text without creating a search expression. (2) There is no need to prepare a dictionary manually beforehand.
  • a data format called a vector is defined and used to process information about the meaning of a word or phrase.
  • This vector is also referred to as a document vector.
  • the document vector is indicated by an arrow in FIG. 5D, for example.
  • the document vector is defined, for example, in an N-dimensional vector space (531 in FIG. 5D), and the direction represents the meaning of a word or phrase.
  • N is an integer, and may be 100 million, for example.
  • a set of keywords is converted into a vector, so it is not related to the number of keywords.
  • FIG. 5A shows a concept search flow.
  • the concept search document data related to a given word or phrase is found by the following procedure.
  • Steps 501 to 503 indicate creation of an index for concept retrieval.
  • Step 501 corresponds to step 101 in FIG. 1A.
  • the analysis processing unit (806 in FIG. 8A) performs natural language analysis of the document data (511), and extracts words or phrases from the document data.
  • Step 502 corresponds to step 102 in FIG. 1A.
  • the clustering unit clusters or classifies the document data (511). By the clustering or classification, for example, a score table of words (vertical axis) ⁇ clusters (horizontal axis) can be generated.
  • Step 503 corresponds to step 103 in FIG. 1A.
  • step 503 when clustering is used, the indexing unit (807 in FIG. 8A) obtains a score or probability that each of the document data (511) belongs to the cluster with reference to the score table.
  • the indexing unit obtains a score or probability that each of the document data (511) belongs to the class with reference to the score table.
  • the score or probability is hereinafter referred to as a first vector.
  • the score or probability may be appropriately converted to a value suitable for the first vector.
  • the first vector value can be converted to a real number between 0 and 1.
  • Steps 504 to 505 show a search by concept search.
  • the search server receives the query (515) input to the user terminal via, for example, the network.
  • the search server may obtain a related keyword associated with the search keyword in response to receiving the query.
  • the search server also obtains a score or probability that the search keyword or the related keyword belongs to the cluster or the classification for the search keyword or the related keyword associated with the search keyword with reference to the score table.
  • the score or probability is hereinafter referred to as a second vector.
  • the score or probability may be appropriately converted to a value suitable for the second vector.
  • the second vector value can be converted to a real number between 0 and 1.
  • step 505 the inner product of the first vector for each document data and the second vector for the search keyword is calculated.
  • the value of each inner product is set as a score for the query of each document data.
  • the document data is displayed as a search result (516) in the order of the score.
  • a threshold value may be provided for the score, and processing for excluding those below the threshold value from the search result (516) may be performed.
  • FIG. 5B shows a specific example of document data search in the concept search of FIG. 5A.
  • document data is searched according to the following procedure.
  • Step 504 corresponds to step 104 in FIG. 1B.
  • the search server receives the query (115) input to the user terminal via, for example, the network.
  • the search unit may obtain a related keyword associated with the search keyword in response to receiving the query (115).
  • the search server calculates a vector (second vector) for the search keyword or a related keyword associated with the search keyword.
  • Step 505 corresponds to step 104 in FIG.
  • the search unit obtains inner products of the first vector of each document data (523, 524) and the second vector from the search keyword or relationship (525, 526).
  • the calculated inner product value is considered as the score of each document data with respect to the search keyword.
  • the score of the cluster 2 is higher than the score of the cluster 1, but each cluster is not sorted in the order of score, and indicates how much accuracy the document data belongs to each cluster.
  • the search server that is also the user terminal or the user terminal displays the document data on the display device as a search result in the order of the score based on the score.
  • a threshold value may be provided for the score, and those below the value may be excluded from the display.
  • As the search result for example, document data is displayed as a list, and a list of document data is preferably displayed in descending order of score.
  • FIG. 5C shows a conceptual diagram of each vector and document product of document data and query in the concept search of FIG. 5A.
  • Each of the plurality of document data (531 and 532) includes a word or a phrase.
  • the vector (534) (first vector) from the document data 1 (531) has a score for each cluster. For example, in the case of k-means, the number of clusters is k (k is an integer). Each score of the vector (534) is, for example, 0.99 for cluster 1 and 0.82 for cluster 2.
  • the vector (535) (first vector) from the document data 2 (532) has a score for each cluster. The score of vector (535) is, for example, 0.72 for cluster 1 and 0.89 for cluster 2.
  • the score may be, for example, an angle with respect to each cluster when the document data is arranged in an N-dimensional document space. For example, in actuality, the score increases as the distance from the interior angle increases. For example, it can be defined as cos ⁇ .
  • the vector (536) (second vector) from the query (533) has a score for each cluster. Each score of the vector (536) is, for example, 0.89 for cluster 1 and 0.76 for cluster 2.
  • a scalar 1 (537) is an inner product of the first vector (534) of the document data 1 (531) and the second vector (536) of the query (533).
  • Scalar 2 (538) is the inner product of the first vector (535) of document data 2 (532) and the second vector (536) of query (533).
  • FIG. 5D shows a conceptual diagram of a document vector in the vector space.
  • a vector for each word belonging to one of four categories biological, information processing engineering, electronic engineering, mechanical engineering
  • the degree of correlation between two document data is defined as an inner product of document vectors for each of the two document data. In the vector space, the closer the two document vectors are, that is, the higher the inner product value, the higher the correlation between the two documents.
  • FIG. 6A shows the concept of the search method according to the first aspect which is an embodiment of the present invention.
  • a correlation value is calculated from a set of document data obtained as a result of concept search and a set of document data including facet keyword A, and a facet This is a method for detecting a keyword.
  • FIG. 6A shows all document data D (601), a set of document data (602) obtained as a result of the concept search, and a set of document data including facet keywords A (603A to 603C).
  • a set of document data (602) obtained as a result of the concept search is a set of document data obtained as a result of the concept search using the search keyword B, and is document data having a high correlation with the search keyword B.
  • (Score> S for B) described in FIG. 6A means that the score for the search keyword B is a set of document data larger than a predetermined value.
  • the concept search using the search keyword B is performed by the method shown in FIGS. 5A to 5D.
  • the set of document data (602) is strongly related to the search keyword B.
  • the document data set (602) does not necessarily include the search keyword B, and conversely, the document data including the search keyword B is not necessarily included in the document data set (602).
  • a set of document data including facet keyword A is indicated by a circle (603A, 603B and 603C). In the drawing, for convenience of space, only three circles are described. If there are N facet keywords (N is an integer), there are N circles (603A to 603N).
  • Facet keywords A are, for example, a 1 , a 2 , and a 3 (... A n ).
  • a circle (603A) is a collection of document data including a 1
  • a circle (603B) is a collection of document data including a 2
  • a circle (603C) is a set of document data including a 3 is there.
  • FIG. 6A shows that the center (607A, 607B, and 607C) of each circle (603A, 603B, and 603C) moves by changing the facet keyword A.
  • a common part (606A, 606B and 606C) of the set of document data (602) obtained as a result of the concept search and each set of document data (603A, 603B and 603C) including facet keyword A is obtained.
  • the movement of the common part means that the document data in the common part and the number of document data (appearance frequency) included in the common part change.
  • FIG. 6B is a conceptual diagram for detecting a facet keyword having a strong correlation with the search keyword by using the facet count.
  • the facet keyword A includes, for example, facet keywords 1 to n (n is an integer).
  • each of the facet keywords 1 to 4 is assumed to be MPEG (611), Fourier transform (612), organic EL (613), and hologram (614).
  • the facet keyword A is a special noun included in the document data regarding the moving image processing.
  • the horizontally long rectangles (611, 612, and 613) shown in FIG. 6B correspond to document data sets (603A, 603B, and 603C) including, for example, facet keyword A in FIG. 6A.
  • the document data set (603D) including the facet keyword 4 corresponding to the horizontally long rectangle (614) shown in FIG. 6B is not shown.
  • a dotted arrow (615) indicates that facet keyword A is changed to facet keywords 1 to 4 for a set of document data obtained as a result of the concept search. That is, the dotted arrow (615) is the same as the circle centers (607A to 607C) in FIG. 6A move sequentially.
  • a rectangle (616) on each facet keyword A shown in FIG. 6B indicates a set of document data narrowed down by the concept search for the search keyword B.
  • a rectangle (616) corresponds to a set (602) of document data obtained as a result of the concept search of FIG. 6A.
  • the extraction of the document data that matches the concept obtained as a result of the concept search is narrowed down, for example, from those having a large inner product value in the concept search.
  • the correlation value with the search keyword B in the query for each facet keyword A taking into account the matching of the concept is, for example, corr concept (s, t) of Equation 20 below and Equations 21 to 21: It is obtained according to Equation 25.
  • corr concept (s, t) is a value obtained by dividing P concept (s ⁇ t) by P concept (s) ⁇ P concept (t).
  • P concept (s) is obtained according to Equation 22.
  • P concept (t) is obtained according to Equation 24.
  • s is a facet keyword.
  • k is the total number of clusters. k is an integer.
  • S 1 , s 2 ,..., S k is each score to which the facet keyword s belongs to each cluster.
  • S is a vector definition of the facet keyword s.
  • P concept (s) is a continuous evaluation of the probability that a word that conceptually matches the facet keyword s appears in the document data.
  • d is document data.
  • N is the total number of document data.
  • d> is the inner product of the document data d with the facet keyword s.
  • P concept (s) is a value obtained by dividing the inner product of the score of the facet keyword s and the document data d by the total number N of document data.
  • T t is a search keyword.
  • T 1 , t 2 ,..., T k are the scores to which the search keyword t belongs to each cluster.
  • T is the definition of the vector of the search keyword t.
  • Pconcept (t) is a continuous evaluation of the probability that words that conceptually match the search keyword will appear in the document data.
  • d> is an inner product of the document data d and the search keyword t.
  • Pconcept (t) is a value obtained by dividing the inner product of the search keyword t and the score of the document data d by the total number N of document data.
  • P concept (s ⁇ t) is the probability that both a word that conceptually matches the search keyword and a word that conceptually matches the facet keyword s will appear.
  • the facet keyword obtained is a facet keyword having a strong correlation with the search keyword B.
  • FIG. 6C shows a flow for creating an index used for the search method of the first aspect of FIG. 6A.
  • index creation is started.
  • an analysis processing unit (806 in FIG. 8A) or an indexing unit (807 in FIG. 8A) is used, but is not limited thereto.
  • the analysis processing unit reads the document data (631) from the storage device into the memory.
  • the analysis processing unit detects the language of the document data from the attribute specified in the document data or the character code used in the document data by using the dictionary (632) including the dictionary or facet definition. .
  • the analysis processing unit performs morphological analysis using the dictionary (632) for the specified language, and detects all words or phrases in the document data.
  • the indexing unit detects the cluster or class to which the document data belongs. The detected cluster or class information is stored in the cluster database (633) in association with the document data.
  • the indexing unit obtains a score (first vector) in which each document data belongs to a cluster or a class for each of all document data. The obtained score is stored in the document data score database (634).
  • the indexing unit indexes the document data using the detected word or phrase as an index. The index is stored, for example, in an index database (635).
  • the indexing unit also stores the score accompanying the index as meta information of the document data.
  • the indexing unit may further store the weight of the detected word or phrase accompanying the index as metadata.
  • step 628 the above steps 622 to 627 are repeated for all the document data (631), and the index creation is completed.
  • step 641 the search unit starts correlation detection.
  • step 642 the search unit receives a search keyword t input from a user or created by a computer and stores it in a memory or a storage device.
  • the search unit may extract a related keyword associated with the search keyword in response to receiving the search keyword t.
  • the search keyword is included in the query, for example.
  • step 643 the search unit reads cluster information from the cluster database (633), and calculates a score (second vector) for each cluster for the search keyword t or a related keyword associated with the search keyword.
  • the score for each cluster is a score to which the search keyword or related keyword belongs to each cluster.
  • the search unit uses the index in the index database (635) to obtain a list A of all document data.
  • the list A includes, for example, an identifier (ID) for identifying document data and information on a location where the corresponding word or phrase appears in the document data.
  • ID an identifier
  • the length of the list A is known.
  • the length of the list A represents the number of all document data (431).
  • the retrieval unit reads the score for each cluster of document data on the list A from the document data score database (634).
  • the search unit obtains the degree of concept match as a score from the score for each cluster of the search keyword t and the cluster score of the corresponding document data. The score for the degree of concept match is used in step 647.
  • step 647 the search unit adds document data in which the score of the degree of concept match obtained in step 646 is larger than a predetermined value to the search result list B (636).
  • step 648 the search unit confirms whether or not the document list A has been subjected to steps 645 to 647. If it is not the end of the document list A, the process returns to step 645. On the other hand, if it is the end of the document list A, the process proceeds to step 649 (FIG. 6E).
  • step 649 the search unit obtains a list C [s] of document data including the keyword s for each facet keyword using the index in the index database (635). All keywords defined as facets are used as facet keywords.
  • step 650 the search unit calculates the length of the list A, the length of the list B, the length of the list C [s], and the length of the common part (product set) of the list B and the list C [s]. Calculate the correlation value for each facet keyword.
  • the correlation value is obtained, for example, according to corr concept (s, t) of equation 20 and equations 21 to 25.
  • step 651 the search unit ends the correlation detection.
  • FIG. 7A shows the concept of the search method according to the second aspect which is an embodiment of the present invention.
  • the search method of the second aspect is a hybrid in which a set of document data including the search keyword B is further combined with the search method of the first aspect of FIG. That is, the search method according to the second aspect obtains a correlation value from a set of document data obtained as a result of the concept search, a set of document data including facet keyword A, and a set of document data including search keyword B.
  • This is a method of calculating and detecting facet keywords having a strong correlation with document data obtained as a result of search keywords and concept searches.
  • FIG. 7A shows all document data D (701), a set of document data including a search keyword B (702), a set of document data obtained as a result of concept search (703), and a document data including a facet keyword A.
  • Each set (704) is shown.
  • a set (704) of document data including facet keyword A is represented by one circle (704).
  • a document including facet keyword A is also shown in FIG. 6A.
  • a set of data is indicated by a plurality of circles according to the number of facet keywords A. That is, if there are N facet keywords, N circles (704A to 704N) are indicated.
  • the center (708A, 708B and 708C) of each circle (704A, 704B and 704C) moves (not shown).
  • the relationship between regions 1, 2 and 3 shown in FIG. 7A is shown in a Venn diagram.
  • the area 1 is obtained by removing the common part of the set (702), the set (703), and the set (704) from the common part of the set (702) and the set (704).
  • Region 2 is a common part of the set (702), the set (703), and the set (704).
  • the area 3 is obtained by removing the common part of the set (702), the set (703), and the set (704) from the common part of the set (703) and the set (704).
  • the correlation value of the common part (region 1, region 2 and region 3) of the set (702, 703 and 704) is obtained according to the correlation function corr total (s, t) expressed by the following equation (26).
  • a and n are adjustable parameters.
  • the weights of the contributions from regions 1, 2, and 3 are adjusted and adjusted with the parameters a and n.
  • corr regular (s, t) is as described in Expression 16 and Expressions 17 to 19.
  • corr concept (s, t) is as described in Equation 20 and Equations 21 to 25.
  • Expression 26 if the value of a is increased, the contribution from document data that only matches the concept and does not include the keyword is greatly included. In Expression 26, when the value of n is increased, contribution from document data whose concepts do not match is suppressed.
  • Area 1 is a set of document data that includes facet keyword A and search keyword B but is not conceptually appropriate. That is, there is no correlation.
  • Region 2 includes facet keyword A and search keyword B, and is conceptually appropriate. That is, the correlation is strong.
  • Region 3 includes facet keyword A but does not include search keyword B, but is conceptually appropriate. That is, there is a correlation.
  • Area 1 is a data set of document data that includes the search keyword B but does not match conceptually.
  • a document data set that does not conceptually match is that it contains (almost) no words and phrases related to the search keyword. Therefore, the area 1 includes a data set of document data that is not preferable as an analysis target.
  • the area 2 includes the search keyword B and matches conceptually.
  • Conceptually matching means including many words or phrases related to the search keyword.
  • region 2 contains a highly desirable document data dataset.
  • Area 3 is a data set of document data that does not include the search keyword B but conceptually matches.
  • Conceptually matching means that many words or phrases related to the search keyword are included as described above.
  • region 3 includes a data set of desirable document data.
  • the flow for creating the index used for the search method of the second aspect of FIG. 7A is basically the same as the flow for creating the index used for the search method of the first aspect of FIG. 6C. Since it is the same, the description is omitted in this specification.
  • step 731 the search unit starts correlation detection.
  • step 732 the search unit receives a search keyword t input from a user or created by a computer and stores it in a memory or a storage device.
  • the search unit may obtain a related keyword associated with the search keyword t.
  • the search keyword is included in the query, for example.
  • step 733 the search unit uses the index in the index database (635) to obtain a list A of all document data. By obtaining the list A, the length of the list A is known. The length of the list A represents the number of all document data (631).
  • the search unit obtains a list B of document data including the search keyword t using the index in the index database (635). By obtaining the list B, the length of the list B is known. The length of the list B represents the number of document data including the search keyword t.
  • the search unit obtains a list C [s] of document data including the keyword s for each facet keyword using the index in the index database (635).
  • the search server reads the cluster information from the cluster database (633), and calculates a score (second vector) for each cluster for the search keyword t or a related keyword associated with the search keyword.
  • the retrieval unit reads the score for each cluster of document data on the list A from the document data score database (634).
  • d> are obtained for the search keyword t and the facet keyword s for the document data d.
  • the definition of the vector is expressed by the following formulas 27 to 29.
  • T 1 , t 2 ..., T k is a score belonging to each cluster of t
  • step 739 the search unit confirms whether or not the document list A has been subjected to steps 737 to 738. If it is not the end of the document list A, the retrieval unit returns to step 737. On the other hand, if it is the end of the document list A, the search unit proceeds to step 740 (FIG. 7C).
  • step 740 the search unit obtains ⁇ t
  • step 741 the search unit ends the correlation detection.
  • FIG. 8A shows a system diagram of a computer including a document data processing unit and an indexing unit according to an embodiment of the present invention.
  • a system according to an embodiment of the present invention includes a computer (801) for index creation (hereinafter also referred to as “index creation computer”), one or more connected to the index creation computer (801) via a network.
  • the index creation computer (801) includes a crawler (805), a document data processing unit (806), an indexing unit (807), a cache (808), and a thumbnail processing unit (809).
  • the crawler (805) collects document data (810), for example, a Web page, from each server (802a to 802n).
  • the crawler (805) is also called a robot or spider.
  • the crawler (805) stores the collected document data (810) in, for example, a storage device (not shown).
  • the crawler also stores document data (810) in the cache (808).
  • the document data processing unit (806) includes an analysis processing unit and a clustering unit.
  • the analysis processing unit processes natural language analysis.
  • the clustering unit clusters or classifies the document data.
  • the indexing unit (807) creates a text index, a facet index, and a thumbnail index of the document data (810). Each of these indexes is stored in an index database (835). Each index is used by the search runtime (811).
  • the search runtime may be on the indexing computer (801) or on another server. If the search runtime (811) is on another server, the index database (835) is copied on the other server. Alternatively, the index database may be located on a shared disk in the storage area network (SAN) so that the indexing computer (801) and other servers can access the index database (835) from both.
  • the indexing unit (807) also includes a cluster database (833), a document data score database (834), and an index database (835). Vector) and index data, respectively.
  • the thumbnail processing unit (809) creates a thumbnail for displaying the document data as an icon on the screen based on the metadata of the document data (810) stored in the cache.
  • the metadata is data for specifying the type and content of a document, for example.
  • the search server (803) receives the query from the user terminal (804), searches the document data (810), and transmits the search result to the user terminal (804).
  • FIG. 8B shows a system diagram of a search server including a search unit according to an embodiment of the present invention.
  • the search server (803) includes a search unit (821).
  • the search server (803) also includes a search result display unit (823) when the search server (803) also has a user terminal.
  • the search server (803) transmits the search result to the user terminal (804)
  • the search server (803) includes a search result transmission unit (822).
  • the retrieval unit (821) retrieves document data using the cluster information, the score of the document data, and the index data from the cluster database (833), the document data score database (834), and the index database (835), respectively. I do.
  • the search unit (821) also stores document data having a score of the degree of concept match greater than a predetermined value in the storage device (836) for the search result list.
  • FIG. 9 shows a block diagram of computer hardware included in each of the systems shown in FIGS. 8A and 8B in the embodiment of the present invention.
  • the computer (901) includes a CPU (902) and a main memory (903), which are connected to a bus (904).
  • the CPU (902) is preferably based on a 32-bit or 64-bit architecture, for example, Intel Xeon (TM) series, Core (TM) series, Atom (TM) series, Pentium (R) series. , Celeron (TM) series, AMD's Phenom (TM) series, Athlon (TM) series, Turion (TM) series, and Empron (TM) can be used.
  • a display (906) such as a TFT monitor is connected to the bus (904) via a display controller (905).
  • the display (906) appropriately displays information about a computer system connected to a network via a communication line and information about software running on the computer system for management of the computer system. Used for display with a graphic interface.
  • a hard disk or silicon disk (908) and a CD-ROM, DVD drive or BD drive (909) are also connected to the bus (904) via an IDE or S-ATA controller (907).
  • the hard disk (908) stores an operating system, application programs and data so that they can be loaded into the main memory (903).
  • the CD-ROM, DVD or BD drive (909) is used for additionally introducing a program from the CD-ROM, DVD-ROM or BD to the hard disk or silicon disk (908) as necessary. Further, a keyboard (911) and a mouse (912) are connected to the bus (904) via a keyboard / mouse controller (910).
  • the communication interface (914) follows, for example, the Ethernet (registered trademark) protocol.
  • the communication interface (914) is connected to the bus (904) via the communication controller (913), and plays a role of physically connecting the computer (901) to the communication line (915), and is an operating system of the computer system.
  • a network interface layer is provided for the TCP / IP communication protocol of the communication function.
  • the communication line may be a wired LAN environment or a wireless LAN environment based on a wireless LAN connection standard such as IEEE802.11a / b / g / n.
  • FIG. 10 shows a keyword search method and search results according to the search method of the first aspect of the present invention and the search method of the second aspect of the present invention.
  • the document data consists of 200,000 cases.
  • the number displayed in association with the detected keyword is the correlation value obtained according to the correlation function. Therefore, this number measures how many times the expected value appears. For example, consider a case in which there are 1 million pieces of all document data, and the document data does not appear with a frequency of one. It is assumed that a set of document data is narrowed down by search and appears once in every 1,000 cases. In this case, the correlation value is 1000. Thus, the correlation value can be hundreds or thousands of values.
  • Table (1001) shows the search results obtained by the keyword search method which is the prior art.
  • a search result by the search method of the first aspect of the present invention is shown in a table (1002).
  • Table (1003) shows the search results obtained by the search method according to the second aspect of the present invention.
  • the keyword search method (1001) if the search keyword is one of "video" from 200,000 document data, 3675 hits are found (1001A), and the search keyword is "video, recording”. In this case, 693 hits are obtained (1001B), and when the search keyword is “video, recording, high density”, 12 hits are made (1001C).
  • the number of keywords is added, the number of document data to be searched is extremely reduced, and document data sufficient for analysis cannot be obtained.
  • the search method of the first aspect of the present invention when the search keyword is one of “video”, 21401 hits are obtained (1002A), and the search keyword is “video, recording, high density”. In some cases, 11004 hits were made (1002B).
  • the keyword search and the result of the first aspect (concept search) of the present invention are combined.
  • the search result strongly reflects the result of the keyword search (1003A).
  • the result of the first aspect is strongly reflected (1003B).
  • correlations can be appropriately found in a wider context than the keyword search method.
  • a document data set is cut out and included in the set according to a concept match score threshold is 0 (false) or 1 (True) and clear. Therefore, the number of document data (hit number) belonging to the set is determined.
  • this boundary is not clearly defined, and the contribution from all the document data is shown as a real number between 0 and 1, which affects the correlation value. Therefore, the number of document data (hit number) belonging to the set is ambiguous. Therefore, in the search result of the second mode, the number of document data belonging to the set is indicated as “(all)” (1003A and 1003B).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
PCT/JP2010/065631 2009-12-09 2010-09-10 検索キーワードから文書データを検索する方法、並びにそのコンピュータ・システム及びコンピュータ・プログラム WO2011070832A1 (ja)

Priority Applications (5)

Application Number Priority Date Filing Date Title
KR1020127016208A KR101419623B1 (ko) 2009-12-09 2010-09-10 검색 키워드로부터 문서 데이터를 검색하는 방법, 그 컴퓨터 시스템 및 컴퓨터 프로그램
JP2011545111A JP5448105B2 (ja) 2009-12-09 2010-09-10 検索キーワードから文書データを検索する方法、並びにそのコンピュータ・システム及びコンピュータ・プログラム
DE201011004087 DE112010004087T5 (de) 2009-12-09 2010-09-10 Verfahren, Computersystem und Computerprogramm zum Durchsuchen von Dokumentdaten unter Verwendung eines Suchbegriffs
CN201080054742.2A CN102640152B (zh) 2009-12-09 2010-09-10 根据检索关键词检索文档数据的方法及其计算机系统
GB1209093.2A GB2488925A (en) 2009-12-09 2010-09-10 Method of searching for document data files based on keywords,and computer system and computer program thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2009-279619 2009-12-09
JP2009279619 2009-12-09

Publications (1)

Publication Number Publication Date
WO2011070832A1 true WO2011070832A1 (ja) 2011-06-16

Family

ID=44083037

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2010/065631 WO2011070832A1 (ja) 2009-12-09 2010-09-10 検索キーワードから文書データを検索する方法、並びにそのコンピュータ・システム及びコンピュータ・プログラム

Country Status (7)

Country Link
US (2) US8380714B2 (ko)
JP (1) JP5448105B2 (ko)
KR (1) KR101419623B1 (ko)
CN (1) CN102640152B (ko)
DE (1) DE112010004087T5 (ko)
GB (1) GB2488925A (ko)
WO (1) WO2011070832A1 (ko)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10372731B1 (en) 2015-11-25 2019-08-06 Minereye Ltd. Method of generating a data object identifier and system thereof
WO2019244276A1 (ja) * 2018-06-20 2019-12-26 楽天株式会社 検索システム、検索方法、及びプログラム
WO2019244277A1 (ja) * 2018-06-20 2019-12-26 楽天株式会社 検索システム、検索方法、及びプログラム
US10922271B2 (en) 2018-10-08 2021-02-16 Minereye Ltd. Methods and systems for clustering files
WO2021065058A1 (ja) * 2019-09-30 2021-04-08 沖電気工業株式会社 概念構造抽出装置、記憶媒体及び方法
US11256821B2 (en) 2015-10-14 2022-02-22 Minereye Ltd. Method of identifying and tracking sensitive data and system thereof

Families Citing this family (68)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8539359B2 (en) * 2009-02-11 2013-09-17 Jeffrey A. Rapaport Social network driven indexing system for instantly clustering people with concurrent focus on same topic into on-topic chat rooms and/or for generating on-topic search results tailored to user preferences regarding topic
DE112010004087T5 (de) 2009-12-09 2012-10-18 International Business Machines Corporation Verfahren, Computersystem und Computerprogramm zum Durchsuchen von Dokumentdaten unter Verwendung eines Suchbegriffs
US20120042263A1 (en) 2010-08-10 2012-02-16 Seymour Rapaport Social-topical adaptive networking (stan) system allowing for cooperative inter-coupling with external social networking systems and other content sources
KR101950529B1 (ko) * 2011-02-24 2019-02-20 렉시스넥시스, 어 디비젼 오브 리드 엘서비어 인크. 전자 문서를 검색하는 방법 및 전자 문서 검색을 그래픽적으로 나타내는 방법
US8676937B2 (en) * 2011-05-12 2014-03-18 Jeffrey Alan Rapaport Social-topical adaptive networking (STAN) system allowing for group based contextual transaction offers and acceptances and hot topic watchdogging
US9298816B2 (en) 2011-07-22 2016-03-29 Open Text S.A. Methods, systems, and computer-readable media for semantically enriching content and for semantic navigation
US8832057B2 (en) * 2011-12-02 2014-09-09 Yahoo! Inc. Results returned for list-seeking queries
TWI467411B (zh) * 2011-12-20 2015-01-01 Ind Tech Res Inst 文件處理方法與系統
US9672493B2 (en) * 2012-01-19 2017-06-06 International Business Machines Corporation Systems and methods for detecting and managing recurring electronic communications
US20130212093A1 (en) 2012-02-15 2013-08-15 International Business Machines Corporation Generating visualizations of a display group of tags representing content instances in objects satisfying a search criteria
JP5567049B2 (ja) * 2012-02-29 2014-08-06 株式会社Ubic 文書分別システム及び文書分別方法並びに文書分別プログラム
CN102662952B (zh) * 2012-03-02 2015-04-15 成都康赛信息技术有限公司 一种基于层次的中文文本并行数据挖掘方法
HUE030528T2 (en) * 2012-03-15 2017-05-29 Cortical Io Gmbh Process, equipment and product for semantic processing of texts
US8832108B1 (en) * 2012-03-28 2014-09-09 Emc Corporation Method and system for classifying documents that have different scales
US9069768B1 (en) 2012-03-28 2015-06-30 Emc Corporation Method and system for creating subgroups of documents using optical character recognition data
US8843494B1 (en) 2012-03-28 2014-09-23 Emc Corporation Method and system for using keywords to merge document clusters
US9396540B1 (en) 2012-03-28 2016-07-19 Emc Corporation Method and system for identifying anchors for fields using optical character recognition data
US9360982B2 (en) 2012-05-01 2016-06-07 International Business Machines Corporation Generating visualizations of facet values for facets defined over a collection of objects
US20130311409A1 (en) * 2012-05-18 2013-11-21 Veetle, Inc. Web-Based Education System
US9304738B1 (en) * 2012-06-14 2016-04-05 Goolge Inc. Systems and methods for selecting content using weighted terms
US9442959B2 (en) * 2012-06-28 2016-09-13 Adobe Systems Incorporated Image search refinement using facets
CN103870973B (zh) * 2012-12-13 2017-12-19 阿里巴巴集团控股有限公司 基于电子信息的关键词提取的信息推送、搜索方法及装置
CN103902535B (zh) * 2012-12-24 2019-02-22 腾讯科技(深圳)有限公司 获取联想词的方法、装置及系统
US8983930B2 (en) * 2013-03-11 2015-03-17 Wal-Mart Stores, Inc. Facet group ranking for search results
US9165053B2 (en) * 2013-03-15 2015-10-20 Xerox Corporation Multi-source contextual information item grouping for document analysis
CN103324664B (zh) * 2013-04-27 2016-08-10 国家电网公司 一种基于傅里叶变换的文档相似判别方法
WO2015016133A1 (ja) * 2013-07-30 2015-02-05 日本電信電話株式会社 情報管理装置及び情報管理方法
US9588978B2 (en) 2013-09-30 2017-03-07 International Business Machines Corporation Merging metadata for database storage regions based on overlapping range values
US9451329B2 (en) * 2013-10-08 2016-09-20 Spotify Ab Systems, methods, and computer program products for providing contextually-aware video recommendation
US8792909B1 (en) * 2013-12-04 2014-07-29 4 Info, Inc. Systems and methods for statistically associating mobile devices to households
CN103744835B (zh) * 2014-01-02 2016-12-07 上海大学 一种基于主题模型的文本关键词提取方法
US9519687B2 (en) 2014-06-16 2016-12-13 International Business Machines Corporation Minimizing index maintenance costs for database storage regions using hybrid zone maps and indices
WO2016086159A2 (en) * 2014-11-26 2016-06-02 Vobis, Inc. Systems and methods to determine and utilize conceptual relatedness between natural language sources
US10042887B2 (en) 2014-12-05 2018-08-07 International Business Machines Corporation Query optimization with zone map selectivity modeling
US11182350B2 (en) 2014-12-09 2021-11-23 International Business Machines Corporation Intelligent XML file fragmentation
US9928232B2 (en) * 2015-02-27 2018-03-27 Microsoft Technology Licensing, Llc Topically aware word suggestions
CN104765862A (zh) * 2015-04-22 2015-07-08 百度在线网络技术(北京)有限公司 文档检索的方法和装置
US10102272B2 (en) * 2015-07-12 2018-10-16 Aravind Musuluri System and method for ranking documents
US9589237B1 (en) 2015-11-17 2017-03-07 Spotify Ab Systems, methods and computer products for recommending media suitable for a designated activity
US10049208B2 (en) * 2015-12-03 2018-08-14 Bank Of America Corporation Intrusion assessment system
TWI571756B (zh) 2015-12-11 2017-02-21 財團法人工業技術研究院 用以分析瀏覽記錄及其文件之方法及其系統
US9875258B1 (en) * 2015-12-17 2018-01-23 A9.Com, Inc. Generating search strings and refinements from an image
CN105654125A (zh) * 2015-12-29 2016-06-08 山东大学 一种视频相似度的计算方法
US10860646B2 (en) 2016-08-18 2020-12-08 Spotify Ab Systems, methods, and computer-readable products for track selection
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US9715495B1 (en) 2016-12-15 2017-07-25 Quid, Inc. Topic-influenced document relationship graphs
JP6738436B2 (ja) * 2016-12-20 2020-08-12 日本電信電話株式会社 音声認識結果リランキング装置、音声認識結果リランキング方法、プログラム
KR101880275B1 (ko) 2017-01-09 2018-08-16 김선중 생물학적 체계 정보 검색 시스템 및 방법
KR101991923B1 (ko) * 2017-03-21 2019-06-21 김선중 키워드 계층 구조를 이용한 생물학적 체계정보 검색 장치 및 방법
US10585928B2 (en) * 2017-04-13 2020-03-10 International Business Machines Corporation Large scale facet counting on sliced counting lists
US11074280B2 (en) * 2017-05-18 2021-07-27 Aiqudo, Inc Cluster based search and recommendation method to rapidly on-board commands in personal assistants
CN108280225B (zh) * 2018-02-12 2021-05-28 北京吉高软件有限公司 一种语义检索方法及检索系统
US20200073890A1 (en) * 2018-08-22 2020-03-05 Three10 Solutions, Inc. Intelligent search platforms
CN110032639B (zh) 2018-12-27 2023-10-31 中国银联股份有限公司 将语义文本数据与标签匹配的方法、装置及存储介质
JP7192507B2 (ja) * 2019-01-09 2022-12-20 富士フイルムビジネスイノベーション株式会社 情報処理装置、及び情報処理プログラム
JP7343311B2 (ja) * 2019-06-11 2023-09-12 ファナック株式会社 文書検索装置及び文書検索方法
TWI715236B (zh) * 2019-10-04 2021-01-01 中華電信股份有限公司 語音主題分類之系統與方法
KR102300352B1 (ko) * 2019-10-14 2021-09-09 (주)디앤아이파비스 중요도 스코어를 바탕으로 특허문서의 유사도를 판단하기 위한 방법, 장치 및 시스템
KR102383965B1 (ko) * 2019-10-14 2022-05-11 (주)디앤아이파비스 유사도 점수 및 비유사도 점수를 바탕으로 특허문서의 유사도를 판단하기 위한 방법, 장치 및 시스템
KR102085217B1 (ko) * 2019-10-14 2020-03-04 (주)디앤아이파비스 특허문서의 유사도 판단 방법, 장치 및 시스템
WO2021146694A1 (en) * 2020-01-17 2021-07-22 nference, inc. Systems and methods for mapping a term to a vector representation in a semantic space
US11475222B2 (en) 2020-02-21 2022-10-18 International Business Machines Corporation Automatically extending a domain taxonomy to the level of granularity present in glossaries in documents
US11436413B2 (en) * 2020-02-28 2022-09-06 Intuit Inc. Modified machine learning model and method for coherent key phrase extraction
US11531708B2 (en) * 2020-06-09 2022-12-20 International Business Machines Corporation System and method for question answering with derived glossary clusters
US11734626B2 (en) * 2020-07-06 2023-08-22 International Business Machines Corporation Cognitive analysis of a project description
CN112100213B (zh) * 2020-09-07 2022-10-21 中国人民解放军海军工程大学 船舶设备技术数据搜索排序方法
US20220121881A1 (en) * 2020-10-19 2022-04-21 Fulcrum Global Technologies Inc. Systems and methods for enabling relevant data to be extracted from a plurality of documents
US11734318B1 (en) 2021-11-08 2023-08-22 Servicenow, Inc. Superindexing systems and methods

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02235176A (ja) * 1989-03-09 1990-09-18 Ricoh Co Ltd 概念検索装置
JP2004287781A (ja) * 2003-03-20 2004-10-14 Mitsubishi Electric Corp 重要度算出装置
JP2005050135A (ja) * 2003-07-29 2005-02-24 Nippon Telegr & Teleph Corp <Ntt> 情報検索システムおよび情報検索方法と、プログラムおよび記録媒体

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001092831A (ja) 1999-09-21 2001-04-06 Toshiba Corp 文書検索装置及び文書検索方法
EP1156430A2 (en) * 2000-05-17 2001-11-21 Matsushita Electric Industrial Co., Ltd. Information retrieval system
JP3573688B2 (ja) * 2000-06-28 2004-10-06 松下電器産業株式会社 類似文書検索装置及び関連キーワード抽出装置
JP4342575B2 (ja) * 2007-06-25 2009-10-14 株式会社東芝 キーワード提示のための装置、方法、及びプログラム
US8024324B2 (en) * 2008-06-30 2011-09-20 International Business Machines Corporation Information retrieval with unified search using multiple facets
DE112010004087T5 (de) 2009-12-09 2012-10-18 International Business Machines Corporation Verfahren, Computersystem und Computerprogramm zum Durchsuchen von Dokumentdaten unter Verwendung eines Suchbegriffs

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02235176A (ja) * 1989-03-09 1990-09-18 Ricoh Co Ltd 概念検索装置
JP2004287781A (ja) * 2003-03-20 2004-10-14 Mitsubishi Electric Corp 重要度算出装置
JP2005050135A (ja) * 2003-07-29 2005-02-24 Nippon Telegr & Teleph Corp <Ntt> 情報検索システムおよび情報検索方法と、プログラムおよび記録媒体

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HIROMITSU NISHIZAKI ET AL.: "A Retrieval Method of Broadcast News Using Voice input Keywords", IEICE TECHNICAL REPORT, vol. 99, no. 523, 20 December 1999 (1999-12-20), pages 91 - 96 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11256821B2 (en) 2015-10-14 2022-02-22 Minereye Ltd. Method of identifying and tracking sensitive data and system thereof
US10372731B1 (en) 2015-11-25 2019-08-06 Minereye Ltd. Method of generating a data object identifier and system thereof
WO2019244276A1 (ja) * 2018-06-20 2019-12-26 楽天株式会社 検索システム、検索方法、及びプログラム
WO2019244277A1 (ja) * 2018-06-20 2019-12-26 楽天株式会社 検索システム、検索方法、及びプログラム
JP6637221B1 (ja) * 2018-06-20 2020-01-29 楽天株式会社 検索システム、検索方法、及びプログラム
JP6639743B1 (ja) * 2018-06-20 2020-02-05 楽天株式会社 検索システム、検索方法、及びプログラム
US11899722B2 (en) 2018-06-20 2024-02-13 Rakuten Group, Inc. Search system, search method, and program
US10922271B2 (en) 2018-10-08 2021-02-16 Minereye Ltd. Methods and systems for clustering files
WO2021065058A1 (ja) * 2019-09-30 2021-04-08 沖電気工業株式会社 概念構造抽出装置、記憶媒体及び方法

Also Published As

Publication number Publication date
CN102640152A (zh) 2012-08-15
JP5448105B2 (ja) 2014-03-19
JPWO2011070832A1 (ja) 2013-04-22
US20120330977A1 (en) 2012-12-27
CN102640152B (zh) 2014-10-15
KR20120113736A (ko) 2012-10-15
US8380714B2 (en) 2013-02-19
DE112010004087T5 (de) 2012-10-18
US20110137921A1 (en) 2011-06-09
US9122747B2 (en) 2015-09-01
GB201209093D0 (en) 2012-07-04
GB2488925A (en) 2012-09-12
GB2488925A9 (en) 2016-05-11
KR101419623B1 (ko) 2014-07-15

Similar Documents

Publication Publication Date Title
JP5448105B2 (ja) 検索キーワードから文書データを検索する方法、並びにそのコンピュータ・システム及びコンピュータ・プログラム
JP5284990B2 (ja) キーワードの時系列解析のための処理方法、並びにその処理システム及びコンピュータ・プログラム
US11573996B2 (en) System and method for hierarchically organizing documents based on document portions
US9589208B2 (en) Retrieval of similar images to a query image
Ceri et al. Web information retrieval
KR101203345B1 (ko) 요약을 이용하여 디스플레이 페이지를 분류하는 방법 및시스템
EP1426882A2 (en) Information storage and retrieval
EP1508105A2 (en) System and method for automatically discovering a hierarchy of concepts from a corpus of documents
EP2255299A2 (en) Method and subsystem for information acquisition and aggregation to facilitate ontology and language-model generation within a content-search-service system
CN106599072B (zh) 一种文本聚类方法及装置
KR20160149050A (ko) 텍스트 마이닝을 활용한 순수 기업 선정 장치 및 방법
KR20180129001A (ko) 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 방법 및 시스템
Sun et al. Identifying, indexing, and ranking chemical formulae and chemical names in digital documents
Rousseau Graph-of-words: mining and retrieving text with networks of features
CN113516202A (zh) Cbl特征提取与去噪的网页精准分类方法
Shah Review of indexing techniques applied in information retrieval
CN116414939B (zh) 基于多维度数据的文章生成方法
Makkonen Semantic classes in topic detection and tracking
El Barbary et al. Egyptian Informatics Journal
WO2014003543A1 (en) Method, system and computer program for generating a query representation of a document, and querying a document retrieval system using said query representation
Kumar et al. A Comprehensive Assessment of Modern Information Retrieval Tools
Aparicio-Carrasco Unsupervised classification of text documents
Pettersson Contextual Advertising Online

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201080054742.2

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10835754

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2011545111

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 1209093

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20100910

WWE Wipo information: entry into national phase

Ref document number: 1209093.2

Country of ref document: GB

WWE Wipo information: entry into national phase

Ref document number: 112010004087

Country of ref document: DE

Ref document number: 1120100040877

Country of ref document: DE

ENP Entry into the national phase

Ref document number: 20127016208

Country of ref document: KR

Kind code of ref document: A

122 Ep: pct application non-entry in european phase

Ref document number: 10835754

Country of ref document: EP

Kind code of ref document: A1