WO2003046765A1 - Method for automatically extracting related words - Google Patents

Method for automatically extracting related words Download PDF

Info

Publication number
WO2003046765A1
WO2003046765A1 PCT/JP2002/012504 JP0212504W WO03046765A1 WO 2003046765 A1 WO2003046765 A1 WO 2003046765A1 JP 0212504 W JP0212504 W JP 0212504W WO 03046765 A1 WO03046765 A1 WO 03046765A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
word
important
database
list
Prior art date
Application number
PCT/JP2002/012504
Other languages
French (fr)
Japanese (ja)
Inventor
Genichiro Sueki
Hiroaki Fujiki
Naoko Yoshino
Kazuko Adachi
Original Assignee
Mitsubishi Space Software Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitsubishi Space Software Co., Ltd. filed Critical Mitsubishi Space Software Co., Ltd.
Publication of WO2003046765A1 publication Critical patent/WO2003046765A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Definitions

  • the present invention relates to an automatic related word extraction method for automatically extracting words closely related to a word specified by a user based on statistical information of words included in a database.
  • the present invention relates to an automatic related word extraction method and a related word automatic extraction device that enable extraction of technical terms appearing in a specific field designated by a user, new words and buzzwords, which are not described in the above. Background art
  • the conventional related word automatic extraction device has an existing thesaurus dictionary as its internal component, and simply searches the thesaurus specified by the user from the thesaurus dictionary and displays the result as a related word extraction result.
  • conventional related word automatic extraction devices have the disadvantage that technical terms, new words, and buzzwords that are not described in existing thesaurus dictionaries cannot be extracted regardless of their importance. there were.
  • the conventional method for automatically extracting related words from statistical data based on data without using an existing thesaurus dictionary uses only the appearance frequency of words that appear alone Is common.
  • the present invention has been made to solve the above-mentioned problems of the prior art, and its purpose is to appear in a specific field specified by a user which is not described in a general existing thesaurus dictionary. Automatic extraction of related terms, new words and buzzwords, and a related word automatic extraction method and related words that can accurately and accurately extract important words closely related to the words specified by the user.
  • An automatic extraction device is provided.
  • the first invention uses a group of documents in a field designated by a user as a database, selects important words that are words of high importance from the documents in the database, and Alternatively, it is characterized by using an automatic related word extraction method that calculates the degree of relevance between important words using statistical information for pairs of important words.
  • the importance refers to the characteristic of the content indicated by the document or the degree to which the characteristic is well represented in the genre of the document.
  • the related words of each field can be automatically extracted. It is characterized by.
  • related words specific to the field such as being related words in one field but not in another field
  • related words specific to the field such as being related words in one field but not in another field
  • existing thesaurus dictionaries Users can set their own fields regardless of the field, so related words can be extracted according to the level of the field set.
  • the database in addition to the configuration of the first or second invention, can be updated / added at any time, and the difference data is sequentially reflected at the time of automatic extraction of related words. It is characterized by having made it.
  • a fourth aspect of the present invention in addition to the configuration of any one of claims 1 to 3, it is determined whether or not the document group in the database is the same document using one piece of document header information. It is characterized in that when the same document is included, one document is left and another same document is removed.
  • the important words are compound words created by dividing the document in the database into parts of speech and dividing them into morphemes.
  • important words are words of speech which are expected to represent characteristics for each document in the database.
  • the words excluded from the important words are retained as an exclusion list, and the words in the exclusion list after extracting important words are excluded from the important words. It is characterized by doing.
  • an important word having the same meaning is held as a same word list, and the words in the same word list are extracted when extracting the important words. It is characterized in that statistical information is collectively stored. According to this, in addition to the effect of any one of claims 1 to 7, it is possible to improve the extraction accuracy of important words.
  • the statistical information includes a total number of appearances in the database and an important word in the database. It is characterized by the ratio of the number of documents to be processed.
  • the statistical information includes, in addition to the single occurrence frequency of an important word included in a document in the database, the occurrence frequency of a plurality of important words within a certain range. It is characterized by being used.
  • the meaning can be more accurately determined by a plurality of pairs of important words, and as a result, related word extraction accuracy can be improved.
  • a surface expression included in a document in the database is automatically extracted, and upper and lower important words automatically constructed from the surface expression. It is characterized by using a hierarchical relationship. According to this, in addition to the effect of claim 9, it is possible to remove noise caused by a plurality of unrelated important words accidentally appearing. The extraction accuracy can be improved.
  • a plurality of different search condition expressions are created, and the plurality of different search condition expressions are generated.
  • the database section according to the first aspect stores a document group in a field designated by a user, and the database section includes a database section.
  • An important word analysis unit that extracts and selects important words to be included, a counting unit that obtains statistical information on the important words selected by the important word analysis unit and information about the hierarchical relationship of the important words, and a count that is generated by the counting unit It comprises a related word extraction unit that calculates the degree of relevance between important words using a list, and is characterized in that a series of processes use the related word automatic extraction method according to claim 1.
  • the user can accurately extract related words desired by the user, such as technical terms, new words, and buzzwords, without being aware of the internal structure of the related word automatic extraction device.
  • the fourteenth invention automatically extracts a plurality of important words using not only the number of appearances of the important words included in the document during the evening but also the number of occurrences of the plurality of important words within a certain range.
  • documents in the database are read one by one, key words are searched from the document, and another key word is found within a predetermined range from the key words searched. Search whether there is any When an important word present in the range is searched, the important word pair is sequentially stored in the count list, the important word pair is searched from the already created count list, and the same important word pair is already counted.
  • the count list is updated by adding 1 to the count of the number of occurrences. If it is not found in the count list, the count of the important word pair is set to 1 and saved in the count list
  • a fifteenth invention is directed to an important word upper / lower hierarchical relationship extraction program that automatically extracts a surface expression included in a document in a database, and uses an upper / lower hierarchical relationship of an important word automatically constructed from the surface expression.
  • An important word upper / lower hierarchical relationship extraction program that automatically extracts a surface expression included in a document in a database, and uses an upper / lower hierarchical relationship of an important word automatically constructed from the surface expression.
  • FIG. 1 is a block diagram of a related word automatic extraction device according to an embodiment of the present invention.
  • FIG. 2 is a conceptual diagram of an important word list used in the related word automatic extraction device according to the embodiment.
  • FIG. 3 is a conceptual diagram of a count list used in the automatic related word extracting apparatus according to the embodiment.
  • FIG. 4 is a conceptual diagram of a relevance judgment list created based on the count list of FIG. 3 and the keyword list of FIG.
  • FIG. 5 is a flowchart showing a procedure for extracting a plurality of important words within a certain range in the method for automatically extracting related words according to the embodiment.
  • FIG. 6 is a flowchart showing a procedure for extracting upper and lower hierarchical relationships of important words in the related word automatic extraction method according to the embodiment.
  • FIG. 1 is a block diagram of a related word automatic extraction device according to an embodiment of the present invention.
  • the automatic related word extraction device includes a database section 1 for storing documents in a field designated by a user, an important word analysis section 2 for extracting and selecting important words contained in the database section 1.
  • the counting unit 3 that obtains statistical information on the important words selected by the important word analyzing unit 2 and the hierarchical relationship information of the important words, and the relationship between the important words using the count list generated by the counting unit 3.
  • It has a related word extraction unit 4 that calculates the degree of importance, and selects important words that are words of high importance from the documents in the database 1 Processing to calculate the degree of relevance between key words is performed using statistical information on the pair.
  • the database unit 1 determines the same document from the input document group, and, when a plurality of the same documents are included, the same document determination function unit 1 1 that leaves one document and removes another same document. And a database 12 for storing the documents from which the same document has been removed by the same document determination function unit 11.
  • the documents in the database 12 are patent documents, extract the “name of the applicant”, “name of the invention” and “name of the inventor” from the header of the patent document, and (1) The names of the applicants are the same. (2) The names of the inventions are the same. (3) The number of inventors is the same, and each of the names of the inventors is the same. Are all the same (in any order). All documents that meet the above conditions (1) to (3) are regarded as the same document.
  • the important word analysis unit 2 includes a morphological analysis unit 21 and an important word extraction unit 22.
  • the morphological analysis unit 21 divides the document in the data base into parts of speech by morphological analysis and acquires part of speech information.
  • the key word extraction unit 22 creates a compound word by performing compound word processing such as combining continuous nouns with the morphemes divided by the morphological analysis unit 21 into parts of speech, for example.
  • the compound word is stored as an important word in the important word list together with the part of speech information and the statistical information.
  • C Important words are limited to compound words created by the above method
  • the part of speech of words that are considered to characterize the content of each document in the database 12 such as common nouns other than compound words, proper nouns, undefined words, etc.
  • This exclusion list may include words to be excluded as long as they partially match, in addition to words to be completely matched for each morpheme.
  • key words with the same meaning are stored in the same word list, and when extracting important words, statistical information on the words in this key word list is saved together to extract important words. Accuracy can be improved.
  • Figure 2 is a conceptual diagram of the important word list.
  • the “statistical information” to be stored in the keyword list includes the number of occurrences 25 of the keyword 23 in the database, and the number of documents 24 containing the keyword in the database. Use proportions. These are information that is the basis of various statistics used in the counting unit 3 and the related word extracting unit 41 later.
  • a plurality of different search condition expressions corresponding to each important word are created, and the plurality of different search condition expressions are super-parallel having a plurality of different processors. It is set separately on the plurality of different processors of the computer, and a full text search is performed simultaneously and in parallel with the plurality of different search condition expressions for the document group stored in the database 12.
  • the results obtained can be used.
  • the number of results that match each search condition expression is the number of documents that include each important word in the database 12.
  • the accuracy of the statistical information can be maintained by performing the full-text search each time the important word analysis unit 2 performs the processing.
  • the massively parallel computer incorporates thousands to tens of thousands of processors (hereinafter collectively referred to as a pipeline) so that a plurality of different search condition expressions can be simultaneously set in the pipeline. And these massive programs A full-text search is performed by simultaneously operating the speech processor and performing multiple search conditions and data-based matching. If a document that matches the search condition is found as a result of the matching, it has a function that regards the document as a hit.
  • the massively parallel computer is desirably a device such as a full-text search engine (for example, FDF (registered trademark) 4 TT ext Finder) manufactured by Paracel Corporation. Good.
  • the counting unit 3 includes an extracting unit 31 for extracting a plurality of important words within a certain range, and an extracting unit 32 for extracting a hierarchical relationship between important words.
  • the user selects in advance either one of the extraction unit 31 for a plurality of important words within a certain range or the extraction unit 32 for the hierarchy of important words, and the user selects one. Only the performed processing is performed.
  • the extraction unit 31 for a plurality of important words within a certain range uses the important words extracted by the important word analysis unit 2 as a reference, and when there is another important word within a certain range defined in advance from the reference. An important word is defined, and the number of occurrences of the plurality of important words is counted and saved as a count list.
  • the procedure for extracting multiple important words is shown in the flowchart of FIG. 5, and the details will be described later.
  • the extraction unit 32 of the upper and lower hierarchical relations of important words defines in advance the surface expression in which the relation between the upper and lower terms is clearly expressed, and includes the important words extracted by the important word analysis unit 2.
  • the surface expression is extracted.
  • the important words in the extracted surface expression are defined as upper and lower important words, and the count of the number of occurrences is stored as a count list.
  • the procedure for extracting the hierarchical relationship of key words is shown in the flowchart of Fig. 6, and the details will be described later.
  • the related word extracting unit 4 includes a related word extracting unit 41.
  • the related word extraction unit 41 performs related word determination based on the count list created by the counting unit 3. For example, to determine dissimilarity between two words, Inf o rm ation Rad ius (.Chr ist opher d.Manning and Hinrich S chut ze, Foundat ions 0 f St at istical Judgment indices such as Natura l Language Proscessing, The MI T Press (MAN FH 0-262-13360-1))) can be used.
  • the extraction unit 31 When the extraction unit 31 is selected, a pair of important words that have a common keyword within a certain range, or when the extraction unit 32 of the upper and lower hierarchical relations of the important word is selected, the lower significant words are common.
  • the key of the key word that is used can also be determined as a related word.
  • Fig. 3 is a conceptual diagram of the count list, where ID 33 of keyword 1 and ID 34 of keyword 2 and the number of occurrences 35 of the pair of keyword 1 and keyword 2 are created as a list item. ing.
  • FIG. 4 is a conceptual diagram of a relevance judgment list created based on the count list of FIG. 3 and the keyword list of FIG.
  • each column and each row is an important word extracted by the important word analysis unit 2, and one of the important word pairs extracted by the counting unit 3 is arranged in a column and the other is arranged in a row. For example, for key word pairs that exist within a certain range in Fig. 5, key word A is placed in a column, and key word B is placed in a row.
  • the upper important words are arranged in columns and the lower important words are arranged in rows.
  • the number of each cell indicates the appearance probability. For example, in column c, row A, “probability that key word A and key word c appear within a certain range” or “probability that key word A is an upper word and key word c is a lower word”.
  • related word determination a description will be given of a determination example in the case of using a determination index of Infoformat on Radius to determine the dissimilarity between two words.
  • the statistic is the “dissimilarity between two words” calculated using this probability of occurrence, and is calculated for all pairs of uppercase letters in each column (A and B, A and C , A and D, ⁇ ⁇ ⁇ , B and C, B and D ' ⁇ ⁇ , C and D, ⁇ ⁇ ⁇ ⁇ ).
  • the probability of occurrence of a, b, c, d, ... for A and the probability of occurrence of a, b, c, d, ... for D Is calculated as dissimilarity.
  • FIG. 5 is a flowchart showing a procedure for counting the number of simultaneous appearances of a plurality of important words existing within a certain range in the related word automatic extraction method according to the embodiment of the present invention.
  • the documents in the database are read one by one (step S1), and the key words extracted by the key word analysis unit 2 are searched from the documents (step S2).
  • the important words to be searched here are not limited to those extracted by the important word analysis unit 2, but may be words included in a user-defined important word list defined by the user in some cases.
  • a user-defined important word list in addition to words whose search condition is a perfect match, words that are searched for if they partially match May be included.
  • the total number of occurrences in the database the ratio of the number of documents in which the key word is included in the database, and the number of characters are applied to the filter of the key word to be searched as necessary. You may. By applying these various filters, important words can be further narrowed down, and as a result, the accuracy of related words finally extracted can be improved.
  • step S3 When an important word is searched (when YES is determined in step S3), another important word (this is referred to as an important word A) within a predetermined range from the searched important word (this is called important word A). A search is made to see if there is an important word B) (step S4).
  • within a certain range means, for example, within one sentence (the range from the beginning of a sentence to the period ".”), which is defined as being close to two before and after, but not limited to this. Specify the range that is expected to represent the feature for each.
  • an important word B existing within a certain range from the important word A is searched for (determined as YES in step S5), a pair of the important word A and the important word B is sequentially stored in a count list.
  • the key word A and the key word B are searched for from the already created count list (step S6), and when the same pair already exists in the count list (when YES is determined in step S7) Then, the count list is updated by adding one to the count of the number of appearances (step S8).
  • step S7 If it does not exist in the count list (if NO is determined in step S7), the count of the pair of the important word A and the important word B is set to 1 and is newly stored in the count list (step S9). .
  • step S10 the processing from step S1 to step S9 is performed for a plurality of documents designated in advance in the database (step S10).
  • step S11 the importance of the pair of important word A and important word B is determined (step S11).
  • step S11 for example, a Die coefficient and a mutual information amount can be used.
  • FIG. 6 is a flowchart showing a procedure for extracting upper and lower hierarchical relationships of important words in the related word automatic extraction method according to the embodiment of the present invention.
  • the documents in the database are read one by one (step S21), and a surface expression described in a surface expression list created in advance is extracted from the document (step S22). .
  • the surface expression to be written in the surface expression list is one in which the relation between the broader word and the lower word is clearly expressed.
  • D such as A, B, C
  • the upper word is D and the lower words are A, B, and C.
  • step S24 the key words extracted by the key word analysis unit 2 are included in the upper word part and the lower word part in the surface expression extracted in step S22 (when YES is determined in step S23).
  • a search is made as to whether or not they are to be performed (step S24).
  • the important words to be searched are not limited to those extracted by the important word analysis unit 2, and may be words included in a user-defined important word list defined in advance by a user in some cases.
  • the user-defined important word list may include words that are to be searched if they partially match, in addition to words for which a perfect match is a search condition.
  • the searched upper and lower important word pairs are sequentially stored in the count list. I do.
  • a judgment scale of the importance of the upper and lower key words a comparison of the ratio of the number of documents containing the upper and lower keywords in the database 12, a comparison of the morphemes of the upper and lower keywords,
  • the upper and lower key word pairs that are always excluded are retained as upper and lower key word pair exclusion lists. The function of excluding upper and lower key word pairs in the upper and lower key word pair exclusion list may be applied as necessary.
  • step S26 The upper and lower key word pairs are searched from the already created count list (step S26), and if the same pair already exists in the count list (if YES is determined in step S27), the occurrence count is counted.
  • the count list is updated by 1 (step S28).
  • step S27 If it does not exist in the count list (when it is determined as NO in step S27), the count of the upper and lower important word pairs is set to 1 and is newly stored in the count list (step S29).
  • step S30 the processing from step S21 to step S29 is performed for a plurality of documents specified in advance in the database (step S30).
  • step S 31 an upper / lower hierarchical relationship of the important words is constructed based on the statistical information in the count list and the important word list created in steps S 21 to S 30 (step S 31).
  • a threshold may be set for all occurrences of the upper / lower keyword pairs in the database, and the upper / lower keyword pairs below the threshold may be excluded as necessary.
  • Industrial applicability in a related word automatic extraction method for automatically extracting words closely related to a word specified by a user based on statistical information of words included in a database, a general existing thesaurus dictionary is used. , Which is not described in the field, can be used effectively as a related automatic extraction device that can implement a technical term that appears in a specific field specified by the user and a related word automatic extraction method that enables extraction of new words and buzzwords. it can.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A document group of the field specified by a user is stored in a database (1). An important word analysis unit (2) selects an important word having a high importance from the document group in the database (1). A count unit (3) creates a count list as statistic information for an important word or an important word pair. According to this count lint, a related word extraction unit (4) judges a correlation degree between important words.

Description

明 細 書  Specification
関連語自動抽出方法 技術分野 Related word automatic extraction method
この発明は、 データベース中に含まれる言葉の統計情報に基づいて、 ユー ザ一が指定した言葉に関連の深い言葉を自動的に抽出する関連語自動抽出方 法において、 一般的な既存のシソーラス辞書には記載されていない、 ユーザ —が指定した特定分野に出現する専門用語や、 新語及び流行語を抽出可能に した関連語自動抽出方法とその関連語自動抽出装置に関するものである。 背景技術  The present invention relates to an automatic related word extraction method for automatically extracting words closely related to a word specified by a user based on statistical information of words included in a database. The present invention relates to an automatic related word extraction method and a related word automatic extraction device that enable extraction of technical terms appearing in a specific field designated by a user, new words and buzzwords, which are not described in the above. Background art
従来の関連語自動抽出装置は、 その内部構成品として既存のシソーラス辞 書を持っており、 ユーザーが指定した言葉を該シソーラス辞書から単に検索 してその結果を関連語抽出結果として表示させるのみであるのが一般的であ しかし、 従来の関連語自動抽出装置では、 既存のシソーラス辞書には記載 されていない専門用語や新語及び流行語はその重要度にかかわらず抽出する ことができないという欠点があった。  The conventional related word automatic extraction device has an existing thesaurus dictionary as its internal component, and simply searches the thesaurus specified by the user from the thesaurus dictionary and displays the result as a related word extraction result. However, conventional related word automatic extraction devices have the disadvantage that technical terms, new words, and buzzwords that are not described in existing thesaurus dictionaries cannot be extracted regardless of their importance. there were.
また、 複数の分野についての関連語が必要な場合、 各分野個別にシソ一ラ ス辞書を用意する必要があつたため、 コスト面でも無駄が多かった。  Also, when related words in multiple fields were required, it was necessary to prepare a thesaurus individually for each field, which was wasteful in terms of cost.
さらに、 既存のシソーラス辞書を用いず、 デ一夕ベースの統計情報から関 連語を自動抽出する方法においても、 従来の関連語自動抽出方法では例えば 単独に出現する言葉の出現頻度のみを使用したものが一般的である。  In addition, the conventional method for automatically extracting related words from statistical data based on data without using an existing thesaurus dictionary, the conventional method for automatically extracting related words uses only the appearance frequency of words that appear alone Is common.
したがって、 たとえ専門用語や新語及び流行語を含んだ文書データベース を用いたとしても、 関連語抽出方法の抽出精度に欠点がありユーザーの所望 する的確な関連語を抽出することが困難であった。 発明の開示 Therefore, a document database containing technical terms, new words and buzzwords Even with the use of, there is a drawback in the extraction accuracy of the related word extraction method, and it has been difficult to extract the exact related word desired by the user. Disclosure of the invention
この発明は上記した従来技術の問題点を解決するためになされたもので、 その目的とするところは、 一般的な既存のシソーラス辞書には記載されてい ない、 ユーザ一が指定した特定分野に出現する専門用語や、 新語及び流行語 を自動抽出することが可能で、 さらにユーザーが指定した言葉に関連の深い 重要語を高精度で的確に抽出することが可能な関連語自動抽出方法及び関連 語自動抽出装置を提供することにある。  The present invention has been made to solve the above-mentioned problems of the prior art, and its purpose is to appear in a specific field specified by a user which is not described in a general existing thesaurus dictionary. Automatic extraction of related terms, new words and buzzwords, and a related word automatic extraction method and related words that can accurately and accurately extract important words closely related to the words specified by the user. An automatic extraction device is provided.
かかる課題を解決するために、 第 1の発明は、 ユーザーが指定した分野の 文書群をデータベースとして用い、 該データベース中の文書から重要度の高 い言葉である重要語を選別し、 該重要語又は重要語のペアに対する統計情報 を用いて重要語同士の関連度を計算する関連語自動抽出方法を使用すること を特徴としている。 ここで、 重要度とは、 その文書が示している内容の特徴、 又はその文書のジャンルにおいてその特徴をよく表している度合いのことを いう。  In order to solve this problem, the first invention uses a group of documents in a field designated by a user as a database, selects important words that are words of high importance from the documents in the database, and Alternatively, it is characterized by using an automatic related word extraction method that calculates the degree of relevance between important words using statistical information for pairs of important words. Here, the importance refers to the characteristic of the content indicated by the document or the degree to which the characteristic is well represented in the genre of the document.
これによれば、 一般的なシソーラス辞書には記載されていない、 ユーザー が指定した特定分野に出現する専門用語や、 新語及び流行語を自動抽出する 方法とその方法を用いた装置を提供することが可能となる。  According to this, there is provided a method for automatically extracting technical terms, new words and buzzwords that are not described in a general thesaurus and appearing in a specific field specified by a user, and an apparatus using the method. Becomes possible.
第 2の発明は、 第 1の発明の構成に加えて、 前記デ一夕ベースに複数分野 の文書群が蓄積されている場合に、 各分野毎の関連語を自動的に抽出可能に したことを特徴としている。  According to a second aspect of the present invention, in addition to the configuration of the first aspect, when a document group of a plurality of fields is stored on the database, the related words of each field can be automatically extracted. It is characterized by.
これによれば、 請求項 1の効果に加えて、 例えば同一の言葉に対して、 あ る分野では関連語となるが、 別の分野では関連語とはならないといった、 分 野特有の関連語を抽出することが可能となる。 また、 既存シソーラス辞書の 分野に関わらずユーザーが独自に分野を設定できるので、 設定した分野のレ ベルに応じた関連語が抽出可能となる。 According to this, in addition to the effect of claim 1, for example, for the same word, related words specific to the field, such as being related words in one field but not in another field, are added. It becomes possible to extract. In addition, existing thesaurus dictionaries Users can set their own fields regardless of the field, so related words can be extracted according to the level of the field set.
第 3の発明は、 第 1又は 2の発明の構成に加えて、 前記デ一夕べ一 スは任意の時期に更新 ·追加が可能であり、 関連語自動抽出の際に差分デー 夕を逐次反映させたことを特徴としている。  According to a third invention, in addition to the configuration of the first or second invention, the database can be updated / added at any time, and the difference data is sequentially reflected at the time of automatic extraction of related words. It is characterized by having made it.
これによれば、 第 1又は 2の発明の効果に加えて、 常に最新のデータべ一 スの情報を反映した新語及び流行語を含む最新の関連語を抽出することが可 能となる。  According to this, in addition to the effects of the first or second invention, it is possible to always extract the latest related words including new words and buzzwords that reflect the latest database information.
第 4の発明は、 請求項 1乃至 3のいずれかの一つの構成に加えて、 前記データベース中の文書群が、 文書のヘッダ一情報を利用して同一文書か 否かを判定し、 複数の同一文書が含まれていた場合に一つの文書を残して他 の同一文書を除去したものであることを特徴としている。  According to a fourth aspect of the present invention, in addition to the configuration of any one of claims 1 to 3, it is determined whether or not the document group in the database is the same document using one piece of document header information. It is characterized in that when the same document is included, one document is left and another same document is removed.
これによれば、 請求項 1乃至 3のいずれか一つの効果に加えて、 特定の文 書が多くの同一文書を持った場合に生じる統計情報の不要な偏りを除去する ことができ、 その結果関連語抽出精度を向上させることが可能となる。 第 5の発明は、 請求項 1乃至 4のいずれか一つの構成に加えて、 重 要語を前記データベース中の文書を品詞単位に分割し分割した形態素から作 成した複合語としたことを特徴としている。  According to this, in addition to the effect of any one of claims 1 to 3, unnecessary bias of statistical information caused when a specific document has many identical documents can be removed, and as a result, Related word extraction accuracy can be improved. According to a fifth aspect of the present invention, in addition to the configuration of any one of claims 1 to 4, the important words are compound words created by dividing the document in the database into parts of speech and dividing them into morphemes. And
これによれば、 請求項 1乃至 4のいずれか一つの効果に加えて、 分割によ る言葉の抽象化を回避することでき、 最終的に抽出する関連語の精度を向上 させることができる。  According to this, in addition to the effect of any one of claims 1 to 4, it is possible to avoid word abstraction due to division, and to improve the accuracy of related words that are finally extracted.
第 6の発明は、 請求項 1乃至 5のいずれか一つの構成に加えて、 重 要語をデ一夕ベース中の文書毎に特徴を表すと予測される品詞としたことを 特徴としている。  According to a sixth aspect of the present invention, in addition to any one of the first to fifth aspects, important words are words of speech which are expected to represent characteristics for each document in the database.
これによれば、 請求項 1乃至 5のいずれか一つの効果に加えて、 抽出する 重要語の漏れを少なくすることができる。 第 7の発明は、 請求項 1乃至 6のいずれか一つの構成に加えて、 重 要語から除外する言葉を除外リス トとして保有し、 重要語抽出後除外リスト 中の言葉を重要語から除外することを特徴としている。 According to this, in addition to the effect of any one of the first to fifth aspects, it is possible to reduce omission of important words to be extracted. In the seventh invention, in addition to any one of claims 1 to 6, the words excluded from the important words are retained as an exclusion list, and the words in the exclusion list after extracting important words are excluded from the important words. It is characterized by doing.
これによれば、 請求項 1乃至 6のいずれか一つの効果に加えて、 不要の言 葉を排除できる。  According to this, in addition to the effect of any one of claims 1 to 6, unnecessary words can be eliminated.
第 8の発明は、 請求項 1乃至 7のいずれか一つの構成に加えて、 同 一の意味を持つ重要語を同一語リストとして保有し、 重要語抽出の際に同一 語リスト中の言葉の統計情報をまとめて保存することを特徴としている。 これによれば、 請求項 1乃至 7のいずれか一つの効果に加えて、 重要語の 抽出精度を向上させることができる。  According to an eighth aspect of the present invention, in addition to the configuration of any one of claims 1 to 7, an important word having the same meaning is held as a same word list, and the words in the same word list are extracted when extracting the important words. It is characterized in that statistical information is collectively stored. According to this, in addition to the effect of any one of claims 1 to 7, it is possible to improve the extraction accuracy of important words.
第 9の発明は、 請求項 1乃至 8のいずれか一つの構成に加えて、 統 計情報は、 デ一夕べ一ス中の全出現回数、 及びデ一夕べ一ス内に重要語が含 まれる文書数の割合であることを特徴としている。  According to a ninth invention, in addition to the configuration according to any one of claims 1 to 8, the statistical information includes a total number of appearances in the database and an important word in the database. It is characterized by the ratio of the number of documents to be processed.
これによれば、 請求項 1乃至 8のいずれか一つの効果に加えて、 抽出精度 を向上させることができる。  According to this, in addition to the effect of any one of claims 1 to 8, the extraction accuracy can be improved.
第 1 0の発明は、 請求項 9の構成に加えて、 前記統計情報には前記 データベース中の文書に含まれる重要語の単独出現回数の他に、 一定範囲内 の複数重要語の出現回数も用いたことを特徴としている。  According to a tenth aspect of the present invention, in addition to the configuration of claim 9, the statistical information includes, in addition to the single occurrence frequency of an important word included in a document in the database, the occurrence frequency of a plurality of important words within a certain range. It is characterized by being used.
これによれば、 請求項 9の効果に加えて、 複数個の重要語のペアによる意 味付けがより正確にでき、 その結果関連語抽出精度を向上させることが可能 となる。  According to this, in addition to the effect of the ninth aspect, the meaning can be more accurately determined by a plurality of pairs of important words, and as a result, related word extraction accuracy can be improved.
第 1 1の発明は、 請求項 9の構成に加えて、 前記統計情報の他に、 前記データベース中の文書に含まれる表層表現を自動抽出し、 該表層表現か ら自動構築した重要語の上下階層関係を用いたことを特徴としている。 これによれば、 請求項 9の効果に加えて、 互いに無関係な複数の重要語が 偶発的に出現したことによるノイズを除去することができ、 その結果関連語 抽出精度を向上させることが可能となる。 In the eleventh invention, in addition to the configuration of claim 9, in addition to the statistical information, a surface expression included in a document in the database is automatically extracted, and upper and lower important words automatically constructed from the surface expression. It is characterized by using a hierarchical relationship. According to this, in addition to the effect of claim 9, it is possible to remove noise caused by a plurality of unrelated important words accidentally appearing. The extraction accuracy can be improved.
第 1 2の発明は、 請求項 1乃至 1 1のいずれか一つの構成に加えて 、 前記統計情報の算出の際、 複数の異なる検索条件式を作成し、 該複数の異 なる検索条件式を複数の異なるプロセッサを有する超並列計算機の前記複数 の異なるプロセッサ上に別個に設定し、 データベース中に蓄積されている文 書群を前記複数の異なる検索条件式で同時並行的に全文検索し、 前記検索条 件式に合致した結果を用いたことを特徴としている。  According to a twelfth aspect of the present invention, in addition to the configuration of any one of claims 1 to 11, when calculating the statistical information, a plurality of different search condition expressions are created, and the plurality of different search condition expressions are generated. Separately setting on the plurality of different processors of a massively parallel computer having a plurality of different processors, and simultaneously and simultaneously performing a full-text search of a group of documents stored in a database with the plurality of different search condition expressions; It is characterized by using results that match the search condition formula.
これによれば、 請求項 1乃至 1 1のいずれか一つの効果に加えて、 統計情 報の算出の際、 複数の異なる検索条件式を作成し、 関連語自動抽出方法を適 用するたびに最新のデータベースに対応した正確な統計情報を用いることが 可能となり、 その結果関連語抽出精度を向上させることが可能となる。 第 1 3の発明は、 請求項 1の構成に加えて、 ユーザーが指定した分野の文 書群を格納する請求項 1に記載のデ一夕べ一ス部と、 該デ一夕ベース部に含 まれる重要語を抽出 ·選別する重要語解析部と、 該重要語解析部で選別した 重要語に対する統計情報及び重要語の上下階層関係情報を取得するカウント 部と、 該カウント部で生成したカウントリストを用いて重要語同士の関連度 を計算する関連語抽出部とからなり、 一連の処理には請求項 1に記載の関連 語自動抽出方法を用いたことを特徴としている。  According to this, in addition to the effect of any one of claims 1 to 11, when calculating the statistical information, a plurality of different search condition expressions are created, and each time the related word automatic extraction method is applied. Accurate statistical information corresponding to the latest database can be used, and as a result, related word extraction accuracy can be improved. According to a thirteenth aspect of the present invention, in addition to the configuration of the first aspect, the database section according to the first aspect stores a document group in a field designated by a user, and the database section includes a database section. An important word analysis unit that extracts and selects important words to be included, a counting unit that obtains statistical information on the important words selected by the important word analysis unit and information about the hierarchical relationship of the important words, and a count that is generated by the counting unit It comprises a related word extraction unit that calculates the degree of relevance between important words using a list, and is characterized in that a series of processes use the related word automatic extraction method according to claim 1.
これによれば、 ユーザーは該関連語自動抽出装置の内部構造を意識するこ となく、 専門用語や新語及び流行語等ユーザーの所望する関連語を的確に抽 出することが可能となる。  According to this, the user can accurately extract related words desired by the user, such as technical terms, new words, and buzzwords, without being aware of the internal structure of the related word automatic extraction device.
第 1 4の発明は、 デ一夕べ一ス中の文書に含まれる重要語の単独出 現回数の他に、 一定範囲内の複数重要語の出現回数も用いて複数重要語を自 動抽出する複数重要語抽出プログラムにおいて、 デ一夕べ一ス内の文書を一 文書ずつ読み込み、 該文書中から重要語を探索し、 探索された重要語から予 め定義した一定範囲内に別の重要語があるか否かを探索し、 重要語から一定 範囲内に存在する重要語が探索された場合に重要語のペアを逐次カウントリ ストに保存し、 重要語のペアを既に作成したカウントリストから探索し、 既 に同一の重要語のペアがカウントリス卜に存在した場合、 出現回数のカウン トに 1加えてカウントリストを更新し、 カウントリス卜に存在しなかった場 合、 前記重要語のペアのカウントを 1にしてカウントリストに新たに保存し、 これらの処理をデータベース内の予め指定した複数文書について行い、 作成 したカウントリストを元に、 重要語のペアの重要度を判定することを特徴と している。 The fourteenth invention automatically extracts a plurality of important words using not only the number of appearances of the important words included in the document during the evening but also the number of occurrences of the plurality of important words within a certain range. In the multiple key word extraction program, documents in the database are read one by one, key words are searched from the document, and another key word is found within a predetermined range from the key words searched. Search whether there is any When an important word present in the range is searched, the important word pair is sequentially stored in the count list, the important word pair is searched from the already created count list, and the same important word pair is already counted. If it is found in the list, the count list is updated by adding 1 to the count of the number of occurrences.If it is not found in the count list, the count of the important word pair is set to 1 and saved in the count list These processes are performed for a plurality of documents specified in advance in the database, and the importance of a pair of important words is determined based on the created count list.
これによれば、 複数個の重要語のペアによる意味付けを合理的にでき、 そ の結果関連語抽出精度を向上させることが可能となる。  According to this, it is possible to rationalize the meaning by a plurality of pairs of important words, and as a result, it is possible to improve the accuracy of related word extraction.
第 1 5の発明は、 データベース中の文書に含まれる表層表現を自動 抽出し、 該表層表現から自動構築した重要語の上下階層関係を用いた重要語 上下階層関係抽出プログラムにおいて、 データベース内の文書を一文書ずつ 読み込み、 該文書中から予め作成しておいた表層表現リス卜に書かれている 表層表現を抽出し、 抽出された表層表現中の上位語部分及び下位語部分に前 記重要語解析部 2で抽出した重要語が含まれるか否かを探索し、 上位語部分 及び下位語部分の双方ともに重要語が探索された場合、 探索された上下重要 語のペアを逐次カウントリス卜に保存し、 既に同一の重要語のペアがカウン トリストに存在した場合、 出現回数のカウントに 1加えてカウントリストを 更新し、 カウントリストに存在しなかった場合、 前記上下重要語のペアの力 ゥントを 1にしてカウントリス トに新たに保存し、 これらの処理をデ一夕べ —ス内の予め指定した複数文書について行い、 作成したカウントリストを元 に重要語の上下階層関係を構築することを特徴としている。  A fifteenth invention is directed to an important word upper / lower hierarchical relationship extraction program that automatically extracts a surface expression included in a document in a database, and uses an upper / lower hierarchical relationship of an important word automatically constructed from the surface expression. Are read one document at a time, and the surface expressions written in the surface expression list created in advance are extracted from the document, and the important words are added to the upper and lower word parts in the extracted surface expressions. A search is performed to determine whether or not the important words extracted by the analysis unit 2 are included.If the important words are searched for in both the upper word part and the lower word part, the searched pair of upper and lower important words is sequentially counted. Save, if the same key word pair already exists in the count list, add 1 to the occurrence count and update the count list.If not, the above and below The key word pair force is set to 1 and saved in the count list, and these processes are performed overnight on a plurality of specified documents in the database, and the upper and lower key words are determined based on the created count list. It is characterized by building hierarchical relationships.
これによれば、 互いに無関係な複数の重要語が偶発的に出現したことによ るノイズを合理的に除去することができる。 図面の簡単な説明 According to this, it is possible to rationally remove noise caused by a plurality of unrelated important words accidentally appearing. BRIEF DESCRIPTION OF THE FIGURES
第 1図は、 この発明の実施の形態に係る関連語自動抽出装置のブロック図 である。  FIG. 1 is a block diagram of a related word automatic extraction device according to an embodiment of the present invention.
第 2図は、 同実施の形態に係る関連語自動抽出装置に使用する重要語リス トの概念図である。  FIG. 2 is a conceptual diagram of an important word list used in the related word automatic extraction device according to the embodiment.
第 3図は、 同実施の形態に係る関連語自動抽出装置に使用するカウントリ ストの概念図である。  FIG. 3 is a conceptual diagram of a count list used in the automatic related word extracting apparatus according to the embodiment.
第 4図は、 第 3図のカウントリスト及び第 2図の重要語リストを元にして 作成した関連度判定リス 卜の概念図である。  FIG. 4 is a conceptual diagram of a relevance judgment list created based on the count list of FIG. 3 and the keyword list of FIG.
第 5図は、 同実施の形態に係る関連語自動抽出方法における、 一定範囲内 の複数重要語の抽出手順を示すフローチャートである。  FIG. 5 is a flowchart showing a procedure for extracting a plurality of important words within a certain range in the method for automatically extracting related words according to the embodiment.
第 6図は、 同実施の形態に係る関連語自動抽出方法における、 重要語の上 下階層関係の抽出手順を示すフローチャートである。 発明を実施するための最良の形態  FIG. 6 is a flowchart showing a procedure for extracting upper and lower hierarchical relationships of important words in the related word automatic extraction method according to the embodiment. BEST MODE FOR CARRYING OUT THE INVENTION
以下、 この発明を図示の実施の形態に基づいて詳細に説明する。  Hereinafter, the present invention will be described in detail based on the illustrated embodiment.
第 1図は、 この発明の実施の形態に係る関連語自動抽出装置のブロック図 である。  FIG. 1 is a block diagram of a related word automatic extraction device according to an embodiment of the present invention.
すなわち、 この関連語自動抽出装置は、 ユーザーが指定した分野の文書群 を格納するデ一夕べ一ス部 1と、 このデータベース部 1に含まれる重要語を 抽出 ·選別する重要語解析部 2と、 重要語解析部 2で選別した重要語に対す る統計情報及び重要語の上下階層関係情報を取得するカウント部 3と、 カウ ント部 3で生成したカウントリス トを用いて重要語同士の関連度を計算する 関連語抽出部 4とを備えた構成となっており、 デ一夕べ一ス部 1中の文書か ら重要度の高い言葉である重要語を選別し、 重要語又は重要語のペアに対す る統計情報を用いて重要語同士の関連度を計算する処理を行う。 データベース部 1は、 入力される文書群から同一文書を判定し、 複数の同 一文書が含まれていた場合に一つの文書を残して他の同一文書を除去する同 一文書判定機能部 1 1及び同一文書判定機能部 1 1で同一文書を除去した後 の文書を格納するデータベース 1 2から構成される。 That is, the automatic related word extraction device includes a database section 1 for storing documents in a field designated by a user, an important word analysis section 2 for extracting and selecting important words contained in the database section 1. The counting unit 3 that obtains statistical information on the important words selected by the important word analyzing unit 2 and the hierarchical relationship information of the important words, and the relationship between the important words using the count list generated by the counting unit 3. It has a related word extraction unit 4 that calculates the degree of importance, and selects important words that are words of high importance from the documents in the database 1 Processing to calculate the degree of relevance between key words is performed using statistical information on the pair. The database unit 1 determines the same document from the input document group, and, when a plurality of the same documents are included, the same document determination function unit 1 1 that leaves one document and removes another same document. And a database 12 for storing the documents from which the same document has been removed by the same document determination function unit 11.
以下、 同一文書判定機能部 1 1について詳しく説明する。  Hereinafter, the same document determination function unit 11 will be described in detail.
例えば、 データベース 1 2中の文書が特許文書であると仮定した場合、 特 許文書のヘッダー部から 「出願人の氏名又は名称」 、 「発明の名称」 及び 「発明者の氏名」 を抽出し、 ( 1 ) 「出願人の氏名又は名称」 が同一である こと ( 2 ) 「発明の名称」 が同一であること ( 3 ) 発明者の人数が一致して いる、 かつ各々の 「発明者の氏名」 がすべて一致している (記載順は不問) ことを判定する。 前記 ( 1 ) 乃至 ( 3 ) の条件に合致した文書群はすべて同 一文書とみなす。  For example, assuming that the documents in the database 12 are patent documents, extract the “name of the applicant”, “name of the invention” and “name of the inventor” from the header of the patent document, and (1) The names of the applicants are the same. (2) The names of the inventions are the same. (3) The number of inventors is the same, and each of the names of the inventors is the same. Are all the same (in any order). All documents that meet the above conditions (1) to (3) are regarded as the same document.
重要語解析部 2は、 形態素解析部 2 1及び重要語の抽出部 2 2から構成さ れる。  The important word analysis unit 2 includes a morphological analysis unit 21 and an important word extraction unit 22.
形態素解析部 2 1では、 前記デ一夕ベース中の文書を形態素解析により品 詞単位に分割し、 品詞情報を取得する。  The morphological analysis unit 21 divides the document in the data base into parts of speech by morphological analysis and acquires part of speech information.
重要語の抽出部 2 2では、 前記形態素解析部 2 1で品詞単位に分割した形 態素を、 例えば、 連続する名詞は結合させる等の複合語処理をすることによ り複合語を作成し、 該複合語を重要語として品詞情報及び統計情報と共に重 要語リストに保存する。 複合語作成により、 分割による言葉の抽象化を回避 することができ、 最終的に抽出する関連語の精度を向上させることができる c 重要語とは、 前記方法により作成した複合語に限られるものではなく、 例 えば複合語以外の普通名詞、 固有名詞、 未定義語等、 データベース 1 2中の 文書のジャンル毎にその文書の内容を特徴付けると考えられる言葉の品詞を ί曰疋 。 The key word extraction unit 22 creates a compound word by performing compound word processing such as combining continuous nouns with the morphemes divided by the morphological analysis unit 21 into parts of speech, for example. The compound word is stored as an important word in the important word list together with the part of speech information and the statistical information. By creating compound words, it is possible to avoid the abstraction of words due to division, and to improve the accuracy of related words finally extracted. C Important words are limited to compound words created by the above method However, the part of speech of words that are considered to characterize the content of each document in the database 12 such as common nouns other than compound words, proper nouns, undefined words, etc.
また、 重要語の抽出後、 場合によっては必ず除外する言葉等を除外リス ト として保有しておき、 除外リス ト中の言葉は重要語から除外する機能を追加 してもよい。 具体的には、 データベースの文書のジャンル毎に、 例えば、 特 許文書であれば 「発明者」 、 「比較例」 等その文書の内容を特徴付けること ができない言葉を除外リス卜に登録することが考えられる。 Also, after extracting important words, if necessary, exclude words etc. You may add a function to exclude words in the exclusion list from important words. Specifically, for each genre of documents in the database, words that cannot characterize the contents of the document, such as “inventor” and “comparative example” for patent documents, can be registered in the exclusion list. Conceivable.
この除外リス卜には、 形態素毎に完全一致することを除外条件とする言葉 の他に、 部分的に一致していれば除外対象とする言葉を含んでいてもよい。 さらに、 同一の意味を持つ重要語を同一語リストとして保有しておき、 重 要語抽出の際に、 この重要語リス ト中の言葉の統計情報をまとめて保存する ことにより、 重要語の抽出精度を向上させることができる。  This exclusion list may include words to be excluded as long as they partially match, in addition to words to be completely matched for each morpheme. In addition, key words with the same meaning are stored in the same word list, and when extracting important words, statistical information on the words in this key word list is saved together to extract important words. Accuracy can be improved.
第 2図は、 重要語リス トの概念図である。  Figure 2 is a conceptual diagram of the important word list.
ここで、 前記重要語リストに保存されるべき 「統計情報」 とは、 重要語 2 3のデータベース中の全出現回数 2 5、 及びデ一夕ベース内に重要語が含ま れる文書数 2 4の割合を用いる。 これらは、 後のカウント部 3及び関連語の 抽出部 4 1で使用する各種統計量の元になる情報である。  Here, the “statistical information” to be stored in the keyword list includes the number of occurrences 25 of the keyword 23 in the database, and the number of documents 24 containing the keyword in the database. Use proportions. These are information that is the basis of various statistics used in the counting unit 3 and the related word extracting unit 41 later.
データベース 1 2内に重要語が含まれる文書数の取得には、 各々の重要語 に対応する複数の異なる検索条件式を作成し、 該複数の異なる検索条件式を 複数の異なるプロセッサを有する超並列計算機の前記複数の異なるプロセッ サ上に別個に設定し、 データベース 1 2中に蓄積されている文書群を前記複 数の異なる検索条件式で同時並行的に全文検索し、 前記検索条件式に合致し た結果を用いることができる。 ここで、 各々の検索条件式に合致した結果数 が、 データベース 1 2中に各々の重要語が含まれる文書数となる。 重要語解 析部 2の処理の都度、 前記全文検索を行うことで統計情報の正確さを保持す ることできる。  In order to obtain the number of documents containing an important word in the database 12, a plurality of different search condition expressions corresponding to each important word are created, and the plurality of different search condition expressions are super-parallel having a plurality of different processors. It is set separately on the plurality of different processors of the computer, and a full text search is performed simultaneously and in parallel with the plurality of different search condition expressions for the document group stored in the database 12. The results obtained can be used. Here, the number of results that match each search condition expression is the number of documents that include each important word in the database 12. The accuracy of the statistical information can be maintained by performing the full-text search each time the important word analysis unit 2 performs the processing.
前記超並列計算機は、 数千乃至数万のプロセッサ (以下、 これらをまとめ てパイプラインという) を内蔵することにより、 このパイプラインに複数の 異なった検索条件式を同時に設定可能としている。 そして、 これら大量のプ 口セッサを同時に動作させることによって、 複数の異なった検索条件式とデ —夕ベースのマッチングを行う全文検索を実行する。 マッチングの結果、 検 索条件式に合致する文書が見つかったら、 その文書がヒッ トしたとみなす機 能を有する。 The massively parallel computer incorporates thousands to tens of thousands of processors (hereinafter collectively referred to as a pipeline) so that a plurality of different search condition expressions can be simultaneously set in the pipeline. And these massive programs A full-text search is performed by simultaneously operating the speech processor and performing multiple search conditions and data-based matching. If a document that matches the search condition is found as a result of the matching, it has a function that regards the document as a hit.
超並列計算機は、 全文検索エンジン (例えば、 P a r a c e l社製、 F D F (登録商標) 4 T T e x t F i n d e r ) のような機器が望ましいが、 これと同等の機能及び性能を有するワークステーション等の機器でもよい。 カウント部 3は、 一定範囲内の複数重要語の抽出部 3 1及び重要語の上下 階層関係の抽出部 3 2から構成される。  The massively parallel computer is desirably a device such as a full-text search engine (for example, FDF (registered trademark) 4 TT ext Finder) manufactured by Paracel Corporation. Good. The counting unit 3 includes an extracting unit 31 for extracting a plurality of important words within a certain range, and an extracting unit 32 for extracting a hierarchical relationship between important words.
関連語自動抽出方法において、 一定範囲内の複数重要語の抽出部 3 1又は 重要語の上下階層関係の抽出部 3 2のいずれか一方の処理を予めユーザーが 選択しておき、 ユーザ一が選択した処理のみを行う。  In the related word automatic extraction method, the user selects in advance either one of the extraction unit 31 for a plurality of important words within a certain range or the extraction unit 32 for the hierarchy of important words, and the user selects one. Only the performed processing is performed.
一定範囲内の複数重要語の抽出部 3 1では、 重要語解析部 2で抽出した重 要語を基準にして、 基準から予め定義した一定の範囲内に別の重要語が存在 する場合を複数重要語と定義し、 該複数重要語の出現数をカウントしたもの をカウントリストとして保存する。 複数重要語の抽出手順を第 5図のフロー チャートに示しているが、 その詳細は後述する。  The extraction unit 31 for a plurality of important words within a certain range uses the important words extracted by the important word analysis unit 2 as a reference, and when there is another important word within a certain range defined in advance from the reference. An important word is defined, and the number of occurrences of the plurality of important words is counted and saved as a count list. The procedure for extracting multiple important words is shown in the flowchart of FIG. 5, and the details will be described later.
重要語の上下階層関係の抽出部 3 2では、 上位語と下位語の関係が明確に 表現されている表層表現を予め定義しておき、 前記重要語解析部 2で抽出し た重要語が含まれる該表層表現を抽出する。 抽出した表層表現中の重要語を 上位重要語及び下位重要語とし、 それらの出現数をカウントしたものをカウ ントリストとして保存する。 重要語の上下階層関係の抽出手順を第 6図のフ ローチャートに示しているが、 その詳細は後述する。  The extraction unit 32 of the upper and lower hierarchical relations of important words defines in advance the surface expression in which the relation between the upper and lower terms is clearly expressed, and includes the important words extracted by the important word analysis unit 2. The surface expression is extracted. The important words in the extracted surface expression are defined as upper and lower important words, and the count of the number of occurrences is stored as a count list. The procedure for extracting the hierarchical relationship of key words is shown in the flowchart of Fig. 6, and the details will be described later.
第 1図において、 関連語抽出部 4は、 関連語の抽出部 4 1からなる。 該関 連語の抽出部 4 1において、 前記カウント部 3で作成したカウントリストを 元に関連語判定を行う。 関連語判定には、 例えば、 二つの言葉の非類似度を判定する I nf o rm a t i o n Rad ius (.Chr i s t ophe r D. Manning and H i n r i c h S chut z e, Foundat i ons 0 f S t at i s t i c a l Natura l Language Pr o c e s s i ng, The MI T Pr e s s (MAN F H 0 - 262 - 13360 - 1 ) ) 等の判定指標を用いることができるが、 これに限らず、 例えば前記一定範囲内の複数重要語の抽出部 31を選択した場合は一定範囲 内に存在する重要語が共通している重要語のペア、 又は前記重要語の上下階 層関係の抽出部 32を選択した場合は下位重要語が共通している重要語のぺ ァを、 関連語と判定することもできる。 In FIG. 1, the related word extracting unit 4 includes a related word extracting unit 41. The related word extraction unit 41 performs related word determination based on the count list created by the counting unit 3. For example, to determine dissimilarity between two words, Inf o rm ation Rad ius (.Chr ist opher d.Manning and Hinrich S chut ze, Foundat ions 0 f St at istical Judgment indices such as Natura l Language Proscessing, The MI T Press (MAN FH 0-262-13360-1))) can be used. When the extraction unit 31 is selected, a pair of important words that have a common keyword within a certain range, or when the extraction unit 32 of the upper and lower hierarchical relations of the important word is selected, the lower significant words are common. The key of the key word that is used can also be determined as a related word.
第 3図は、 カウントリストの概念図であり、 重要語 1の ID 33、 重要語 2の I D 34、 重要語 1と重要語 2のペアの出現回数 35がリスト項目とし てカウントリストが作成されている。  Fig. 3 is a conceptual diagram of the count list, where ID 33 of keyword 1 and ID 34 of keyword 2 and the number of occurrences 35 of the pair of keyword 1 and keyword 2 are created as a list item. ing.
第 4図は、 第 3図のカウントリスト及び第 2図の重要語リストを元にして 作成した関連度判定リス トの概念図である。  FIG. 4 is a conceptual diagram of a relevance judgment list created based on the count list of FIG. 3 and the keyword list of FIG.
第 4図の各列に配置されている言葉 A、 B、 C、 D、 · · ·が関連語判定 の対象になる関連語判定対象語 (重要語) 42の集合で、 各行に配置されて いる言葉 a、 b、 c、 d、 · · ·が関連語判定の用いる関連語判定使用語 (重要語) 43である。 基本的に、 各列、 各行とも重要語解析部 2で抽出し た重要語であり、 カウント部 3で抽出した重要語ペアの片方が列に、 もう片 方が行に配置される。 例えば、 第 5図の一定範囲内に存在する重要語ペアで は重要語 Aを列に、 重要語 Bを行に配置する。 第 6図の上下重要語ペアでは 上位重要語を列に、 下位重要語を行に配置する。 第 4図の関連度判定リス ト において、 各セルの数字は、 出現確率を表している。 例えば c列 A行では、 「重要語 Aと重要語 cが一定範囲内に出現する確率」 、 又は 「重要語 Aが上 位語で重要語 cが下位語である確率」 を表す。 以下、 関連語判定の一例として、 二つの言葉の非類似度を判定するのに I nf o rmat i on Rad iusの判定指標を用いた場合の判定例につ いて説明する。 The words A, B, C, D,... Arranged in each column in FIG. 4 are the set of related word judgment target words (keywords) 42 to be subjected to related word judgment, and are arranged in each row. The words a, b, c, d, · · · are the related word judgment use words (important words) 43 used in the related word judgment. Basically, each column and each row is an important word extracted by the important word analysis unit 2, and one of the important word pairs extracted by the counting unit 3 is arranged in a column and the other is arranged in a row. For example, for key word pairs that exist within a certain range in Fig. 5, key word A is placed in a column, and key word B is placed in a row. In the upper and lower important word pairs in Fig. 6, the upper important words are arranged in columns and the lower important words are arranged in rows. In the association degree judgment list in FIG. 4, the number of each cell indicates the appearance probability. For example, in column c, row A, “probability that key word A and key word c appear within a certain range” or “probability that key word A is an upper word and key word c is a lower word”. Hereinafter, as an example of related word determination, a description will be given of a determination example in the case of using a determination index of Infoformat on Radius to determine the dissimilarity between two words.
統計量は、 この出現確率を用いて計算される 「二つの言葉の非類似度」 で、 各列に配置された大文字アルファべッ 卜のすべてのペアについて計算する (Aと B、 Aと C、 Aと D、 · · ·、 Bと C、 Bと D ' · ·、 Cと D、 · · · ) 。 重要語 Aと重要語 Dの関連度判定を例にとり説明すると、 Aに対する a、 b、 c、 d、 · · ' 出現確率と、 Dに対する a、 b、 c、 d、 · · ·の 出現確率の違いが、 非類似度として算出される。 仮にすベての行において出 現確率が同じ値 (a行 A列 =a行 D列、 b行 A列 =b行 D列、 c行 A列 =c 行 D列、 d行 A列 =d行 D列、 · · ·) であれば、 非類似度は 0、 つまり A と Dの類似度は最大となり、 したがって、 重要語 Aと重要語 Dの関連度は最 大となる。 逆に、 出現確率が共に 0でない言葉 a、 b、 c、 d、 · · 'がー つもなければ非類似度は最大、 つまり関連度は最小となる。 以上のように、 すべての大文字アルファベッ トのペアについて、 統計量を計算し、 ある閾値 以下のペアのみ互いに関連のある言葉 (関連語) と判定する。  The statistic is the “dissimilarity between two words” calculated using this probability of occurrence, and is calculated for all pairs of uppercase letters in each column (A and B, A and C , A and D, · · ·, B and C, B and D '· ·, C and D, · · · ·). Taking as an example the determination of the degree of relevance between important words A and D, the probability of occurrence of a, b, c, d, ... for A and the probability of occurrence of a, b, c, d, ... for D Is calculated as dissimilarity. Probability of occurrence is the same in all rows (a row A column = a row D column, b row A column = b row D column, c row A column = c row D column, d row A column = d In the row D column, · · · ·), the dissimilarity is 0, that is, the similarity between A and D is the largest, and therefore, the relevance between the important words A and D is the largest. Conversely, if there are no words a, b, c, d, · · 'that have non-zero occurrence probabilities, the dissimilarity is maximum, that is, the relevance is minimum. As described above, statistics are calculated for all pairs of uppercase alphabets, and only pairs below a certain threshold are judged to be related words (related words).
第 5図は、 この発明の実施の形態に係る関連語自動抽出方法における、 一 定範囲内に存在する複数個の重要語の同時出現回数をカウントする手順を示 すフローチヤ一トである。  FIG. 5 is a flowchart showing a procedure for counting the number of simultaneous appearances of a plurality of important words existing within a certain range in the related word automatic extraction method according to the embodiment of the present invention.
まず、 データベース内の文書を一文書ずつ読み込み (ステップ S 1) 、 該 文書中から前記重要語解析部 2で抽出した重要語を探索する (ステップ S 2)  First, the documents in the database are read one by one (step S1), and the key words extracted by the key word analysis unit 2 are searched from the documents (step S2).
ここで探索すべき重要語とは、 前記重要語解析部 2で抽出したものに限ら ず、 場合によっては予めユーザーが定義したユーザ一定義重要語リストに含 まれる言葉でもよい。 ユーザ一定義重要語リス卜には、 完全一致することを 探索条件とする言葉の他に、 部分的に一致していれば探索対象とする言葉を 含んでいてもよい。 The important words to be searched here are not limited to those extracted by the important word analysis unit 2, but may be words included in a user-defined important word list defined by the user in some cases. In the user-defined important word list, in addition to words whose search condition is a perfect match, words that are searched for if they partially match May be included.
さらに、 探索すべき言葉の重要度の判定尺度として、 データベース中の全 出現回数、 データベース内にその重要語が含まれる文書数の割合や文字数を 必要に応じて探索対象重要語のフィルターに適用してもよい。 これらの各種 フィル夕一を適用することにより、 重要語を更に絞り込むことができ、 その 結果最終的に抽出される関連語の精度を向上させることができる。  Furthermore, as criteria for determining the importance of the word to be searched, the total number of occurrences in the database, the ratio of the number of documents in which the key word is included in the database, and the number of characters are applied to the filter of the key word to be searched as necessary. You may. By applying these various filters, important words can be further narrowed down, and as a result, the accuracy of related words finally extracted can be improved.
重要語が探索された場合 (ステップ S 3で Y E Sと判定された場合) 、 探 索された重要語 (これを重要語 Aとよぶ) から予め定義した一定範囲内に別 の重要語 (これを重要語 Bとよぶ) があるか否かを探索する (ステップ S 4 ) 。  When an important word is searched (when YES is determined in step S3), another important word (this is referred to as an important word A) within a predetermined range from the searched important word (this is called important word A). A search is made to see if there is an important word B) (step S4).
一定範囲内とは、 例えば、 一文内 (一文の先頭から句点 「。 」 までの範 囲) で、 前後二つまで近接したものを一定範囲内と定義するが、 これに限ら ずデータベース中の文書毎に特徴を表すと予測される範囲を指定する。 重要語 Aから一定範囲内に存在する重要語 Bが探索された場合 (ステップ S 5で Y E Sと判定された場合) 、 重要語 A及び重要語 Bのペアを逐次カウント リス 卜に保存する。  The term "within a certain range" means, for example, within one sentence (the range from the beginning of a sentence to the period "."), Which is defined as being close to two before and after, but not limited to this. Specify the range that is expected to represent the feature for each. When an important word B existing within a certain range from the important word A is searched for (determined as YES in step S5), a pair of the important word A and the important word B is sequentially stored in a count list.
重要語 A及び重要語 Bのペアを既に作成したカウントリス トから探索し (ス テツプ S 6 ) 、 既に同一のペアがカウントリス 卜に存在した場合 (ステップ S 7で Y E Sと判定された場合) 、 出現回数のカウントに 1加えてカウント リス トを更新する (ステップ S 8 ) 。  The key word A and the key word B are searched for from the already created count list (step S6), and when the same pair already exists in the count list (when YES is determined in step S7) Then, the count list is updated by adding one to the count of the number of appearances (step S8).
カウントリストに存在しなかった場合 (ステップ S 7で N Oと判定された 場合) 、 前記重要語 A及び重要語 Bのペアのカウントを 1にしてカウントリス トに新たに保存する (ステップ S 9 ) 。  If it does not exist in the count list (if NO is determined in step S7), the count of the pair of the important word A and the important word B is set to 1 and is newly stored in the count list (step S9). .
以上、 ステップ S 1乃至ステップ S 9の処理をデータベース内の予め指定 した複数文書について行う (ステップ S 1 0 ) 。  As described above, the processing from step S1 to step S9 is performed for a plurality of documents designated in advance in the database (step S10).
その後、 前記ステップ S 1乃至 S 1 0で作成したカウントリスト及び重要 語リスト中の統計情報を元に、 重要語 A及び重要語 Bのペアの重要度を判定 する (ステップ S 1 1 ) 。 ステップ S 1 1には、 例えば、 D i e e係数や相 互情報量等を用いることができる。 After that, the count list created in steps S1 to S10 and the important Based on the statistical information in the word list, the importance of the pair of important word A and important word B is determined (step S11). In step S11, for example, a Die coefficient and a mutual information amount can be used.
第 6図は、 この発明の実施の形態に係る関連語自動抽出方法における、 重 要語の上下階層関係を抽出する手順を示すフローチャートである。  FIG. 6 is a flowchart showing a procedure for extracting upper and lower hierarchical relationships of important words in the related word automatic extraction method according to the embodiment of the present invention.
まず、 デ一夕ベース内の文書を一文書ずつ読み込み (ステップ S 2 1 ) 、 該文書中から予め作成しておいた表層表現リストに書かれている表層表現を 抽出する (ステップ S 2 2 ) 。  First, the documents in the database are read one by one (step S21), and a surface expression described in a surface expression list created in advance is extracted from the document (step S22). .
ここで、 前記表層表現リストに書かれるべき表層表現とは、 上位語と下位 語の関係が明確に表現されているものであり、 例えば、 「A、 B、 C等の D」 (A乃至 Dは各々重要語とする) という表現においては、 上位語が D、 下位語が A、 B、 Cである。  Here, the surface expression to be written in the surface expression list is one in which the relation between the broader word and the lower word is clearly expressed. For example, “D such as A, B, C” (A to D) Are the important words.) In the expression, the upper word is D and the lower words are A, B, and C.
次に、 前記ステップ S 2 2で抽出された (ステップ S 2 3で Y E Sと判定 された場合) 表層表現中の上位語部分及び下位語部分に前記重要語解析部 2 で抽出した重要語が含まれるか否かを探索する (ステップ S 2 4 ) 。  Next, the key words extracted by the key word analysis unit 2 are included in the upper word part and the lower word part in the surface expression extracted in step S22 (when YES is determined in step S23). A search is made as to whether or not they are to be performed (step S24).
ここで、 探索すべき重要語とは、 前記重要語解析部 2で抽出したものに限 らず、 場合によっては予めユーザ一が定義したユーザ一定義重要語リストに 含まれる言葉でもよい。 また、 ユーザー定義重要語リス トには、 完全一致す ることを探索条件とする言葉の他に、 部分的に一致していれば探索対象とす る言葉を含んでいてもよい。  Here, the important words to be searched are not limited to those extracted by the important word analysis unit 2, and may be words included in a user-defined important word list defined in advance by a user in some cases. In addition, the user-defined important word list may include words that are to be searched if they partially match, in addition to words for which a perfect match is a search condition.
この探索により、 上位語部分及び下位語部分の双方ともに重要語が探索さ れた場合 (ステヅプ S 2 5で Y E Sと判定された場合) 、 探索された上下重 要語ペアを逐次カウントリストに保存する。 この時、 上下重要語ペアの重要 度の判定尺度として、 データベース 1 2内に上位重要語及び下位重要語が含 まれる文書数の割合の比較、 上位重要語及び下位重要語の形態素の比較、 及 び必ず除外する上下重要語ペアを上下重要語ペア除外リス卜として保有して おき、 上下重要語ペア除外リスト中の上下重要語ペアは除外する機能等を必 要に応じて適用しても When an important word is found in both the upper word part and the lower word part by this search (when YES is determined in step S25), the searched upper and lower important word pairs are sequentially stored in the count list. I do. At this time, as a judgment scale of the importance of the upper and lower key words, a comparison of the ratio of the number of documents containing the upper and lower keywords in the database 12, a comparison of the morphemes of the upper and lower keywords, The upper and lower key word pairs that are always excluded are retained as upper and lower key word pair exclusion lists. The function of excluding upper and lower key word pairs in the upper and lower key word pair exclusion list may be applied as necessary.
よい。 Good.
上下重要語ペアを既に作成したカウントリストから探索し (ステップ S 2 6 ) 、 既に同一のペアがカウントリストに存在した場合 (ステップ S 2 7で Y E Sと判定された場合) 、 出現回数のカウントに 1加えてカウントリス ト を更新する (ステップ S 2 8 ) 。  The upper and lower key word pairs are searched from the already created count list (step S26), and if the same pair already exists in the count list (if YES is determined in step S27), the occurrence count is counted. The count list is updated by 1 (step S28).
カウントリス卜に存在しなかった場合 (ステップ S 2 7で N Oと判定され た場合) 、 前記上下重要語ペアのカウントを 1にしてカウントリストに新た に保存する (ステップ S 2 9 ) 。  If it does not exist in the count list (when it is determined as NO in step S27), the count of the upper and lower important word pairs is set to 1 and is newly stored in the count list (step S29).
以上、 ステップ S 2 1乃至ステップ S 2 9の処理をデータベース内の予め 指定した複数文書について行う (ステップ S 3 0 ) 。  As described above, the processing from step S21 to step S29 is performed for a plurality of documents specified in advance in the database (step S30).
その後、 前記ステップ S 2 1乃至 S 3 0で作成したカウントリスト及び重 要語リスト中の統計情報を元に重要語の上下階層関係を構築する (ステップ S 3 1 )  Thereafter, an upper / lower hierarchical relationship of the important words is constructed based on the statistical information in the count list and the important word list created in steps S 21 to S 30 (step S 31).
具体的には、 例えば、 共通の下位重要語 Cを持つ上位重要語 A及び Bが抽 出されていると同時に上位重要語 A及び下位重要語 Bが抽出されている場合、 全体的にみれば直接の上下関係になっているペアは A (上位) — B (下位) ペア及び B (上位) — C (下位) ペアのみであり、 A (上位) — C (下位) ペアは冗長分に過ぎない。 したがって、 重要語の上下階層関係を構築する際 に前記 A— Cの冗長ペアを除外する。  Specifically, for example, when upper keywords A and B having a common lower keyword C are extracted and upper keywords A and B are extracted at the same time, The only pairs that are in a direct hierarchical relationship are A (upper) — B (lower) pairs and B (upper) — C (lower) pairs, and A (upper) — C (lower) pairs are only redundant. Absent. Therefore, when constructing the hierarchical relationship of the key words, the redundant pair of AC is excluded.
また、 上下階層関係の構築の際、 前記上下重要語ペアのデータベース中で の全出現回数に閾値を設け、 閾値未満の該上下重要語ペアを必要に応じて除 外してもよい。 産業上の利用可能性 この発明によれば、 データベース中に含まれる言葉の統計情報に基づいて、 ユーザーが指定した言葉に関連の深い言葉を自動的に抽出する関連語自動抽 出方法において、 一般的な既存のシソーラス辞書には記載されていない、 ュ 一ザ一が指定した特定分野に出現する専門用語や、 新語及び流行語を抽出可 能にした関連語自動抽出方法が実現できる関連自動抽出装置として有効に使 用できる。 In constructing the upper / lower hierarchical relationship, a threshold may be set for all occurrences of the upper / lower keyword pairs in the database, and the upper / lower keyword pairs below the threshold may be excluded as necessary. Industrial applicability According to the present invention, in a related word automatic extraction method for automatically extracting words closely related to a word specified by a user based on statistical information of words included in a database, a general existing thesaurus dictionary is used. , Which is not described in the field, can be used effectively as a related automatic extraction device that can implement a technical term that appears in a specific field specified by the user and a related word automatic extraction method that enables extraction of new words and buzzwords. it can.

Claims

請求の範囲 The scope of the claims
1 . ユーザーが指定した分野の文書群をデータベースとして用い、 該データ ベース中の文書から重要度の高い言葉である重要語を選別し、 該重要語又は 重要語のペアに対する前記データベース中に含まれる言葉の統計情報を用い て重要語同士の関連度を計算して関連語を抽出することを特徴とする関連語 自動抽出方法。 1. Using a group of documents in the field specified by the user as a database, selecting important words, which are words of high importance, from the documents in the database, and including the important words or pairs of important words in the database. A related word automatic extraction method characterized by calculating the degree of relevance between important words using word statistical information and extracting related words.
2 . 前記データベースに複数分野の文書群が蓄積されている場合に、 各分野 毎の関連語を自動的に抽出可能にしたことを特徴とする請求項 1に記載の関 連語自動抽出方法。  2. The related word automatic extraction method according to claim 1, wherein, when a document group in a plurality of fields is stored in the database, a related word for each field can be automatically extracted.
3 . 前記データベースは任意の時期に更新 '追加が可能であり、 関連語自動 抽出の際に差分データを逐次反映させたことを特徴とする請求項 1又は 2に 記載の関連語自動抽出方法。  3. The related word automatic extraction method according to claim 1, wherein the database can be updated and added at any time, and the difference data is sequentially reflected upon automatic related word extraction.
4 . 前記データベース中の文書群が、 文書のヘッダ一情報を利用して同一文 書か否かを判定し、 複数の同一文書が含まれていた場合に一つの文書を残し て他の同一文書を除去したものであることを特徴とする請求項 1乃至 3のい ずれかの一つに記載の関連語自動抽出方法。  4. The group of documents in the database determines whether or not the same sentence is written using one header information of the document, and when a plurality of the same documents are included, one document is left and another same document is deleted. 4. The method for automatically extracting related words according to claim 1, wherein the related words are removed.
5 . 前記重要語は、 データベース中の文書を品詞単位に分割し、 分割した形 態素から作成した複合語である請求項 1乃至 4のいずれか一つに記載の関連 語自動抽出方法。  5. The related word automatic extraction method according to any one of claims 1 to 4, wherein the important word is a compound word created by dividing a document in a database into parts of speech and creating the divided morphemes.
6 . 前記重要語は、 データベース中の文書毎に特徴を表すと予測される品詞 である請求項 1乃至 5のいずれか一つに記載の関連語自動抽出方法。  6. The related word automatic extraction method according to any one of claims 1 to 5, wherein the important word is a part of speech that is predicted to represent a feature for each document in the database.
7 . 重要語から除外する言葉を除外リス トとして保有し、 重要語抽出後、 除 外リスト中の言葉を重要語から除外する請求項 1乃至 6のいずれか一つに記 載の関連語自動抽出方法。  7. Retain words excluded from important words as an exclusion list, and after extracting important words, exclude words in the exclusion list from important words. Extraction method.
8 . 同一の意味を持つ重要語を同一語リストとして保有し、 重要語抽出の際 に同一語リスト中の言葉の統計情報をまとめて保存する請求項 1乃至 7のい ずれか一つに記載の関連語自動抽出方法。 8. Keep important words with the same meaning as the same word list, and extract important words 8. The method for automatically extracting related words according to any one of claims 1 to 7, wherein statistical information of words in the same word list is collectively stored.
9 . 前記統計情報は、 データベース中の全出現回数、 及びデータベース内に 重要語が含まれる文書数の割合である請求項 1乃至 8のいずれか一つに記載 の関連語自動抽出方法。  9. The related word automatic extraction method according to any one of claims 1 to 8, wherein the statistical information is a total number of occurrences in the database and a ratio of the number of documents including the important word in the database.
1 0 . 前記統計情報には前記データベース中の文書に含まれる重要語の単独 出現回数の他に、 一定範囲内の複数重要語の出現回数も用いたことを特徴と する請求項 9に記載の関連語自動抽出方法。  10. The statistical information according to claim 9, wherein the number of appearances of a plurality of important words within a certain range is used in addition to the number of single appearances of an important word included in the document in the database. 10. Related word automatic extraction method.
1 1 . 前記統計情報の他に、 前記データベース中の文書に含まれる表層表現 を自動抽出し、 該表層表現から自動構築した重要語の上下階層関係を用いた ことを特徴とする請求項 9に記載の関連語自動抽出方法。  11. In addition to the statistical information, a surface expression included in a document in the database is automatically extracted, and upper and lower hierarchical relationships of important words automatically constructed from the surface expression are used. Related word automatic extraction method of description.
1 2 . 前記統計情報の算出の際、 複数の異なる検索条件式を作成し、 該複数 の異なる検索条件式を複数の異なるプロセッサを有する超並列計算機の前記 複数の異なるプロセッサ上に別個に設定し、 デ一夕ベース中に蓄積されてい る文書群を前記複数の異なる検索条件式で同時並行的に全文検索し、 前記検 索条件式に合致した結果を用いたことを特徴とする請求項 1乃至 1 1のいず れか一つに記載の関連語自動抽出方法。  1 2. In calculating the statistical information, a plurality of different search condition expressions are created, and the plurality of different search condition expressions are separately set on the plurality of different processors of a massively parallel computer having a plurality of different processors. 2. The method according to claim 1, wherein a document group stored in the database is searched in full text simultaneously and in parallel with the plurality of different search condition expressions, and a result matching the search condition expression is used. Or the related word automatic extraction method described in any one of (1) to (11).
1 3 . ユーザ一が指定した分野の文書群を格納する請求項 1に記載のデ一夕 ベース部と、 該デ一夕ベース部に含まれる重要語を抽出 ·選別する重要語解 析部と、 該重要語解析部で選別した重要語に対する統計情報及び重要語の上 下階層関係情報を取得するカウント部と、 該カウント部で生成したカウント リストを用いて重要語同士の関連度を計算する関連語抽出部とからなり、 一 連の処理には請求項 1に記載の関連語自動抽出方法を用いたことを特徴とす る関連語自動抽出装置。  13. The database unit according to claim 1, which stores a document group in a field designated by a user, and an important word analysis unit that extracts and selects important words included in the database. A counting unit that obtains statistical information on the important words selected by the important word analyzing unit and upper and lower hierarchical relation information of the important words; and calculates a degree of association between the important words using the count list generated by the counting unit. An automatic related word extraction apparatus comprising a related word extraction unit, wherein the series of processes uses the related word automatic extraction method according to claim 1.
1 4 . データベース中の文書に含まれる重要語の単独出現回数の他に、 一定 範囲内の複数重要語の出現回数も用いて複数重要語を自動抽出する複数重要 語抽出プログラムにおいて、 1 4. Multiple important words that automatically extract multiple important words using the number of appearances of multiple important words within a certain range in addition to the single occurrence number of important words included in the documents in the database In the word extraction program,
データベース内の文書を一文書ずつ読み込み、 該文書中から重要語を探索 し、 探索された重要語から予め定義した一定範囲内に別の重要語があるか否 かを探索し、 重要語から一定範囲内に存在する重要語が探索された場合に重 要語のペアを逐次カウントリス卜に保存し、 重要語のペアを既に作成した力 ゥントリス 卜から探索し、 既に同一の重要語のペアがカウントリストに存在 した場合、 出現回数のカウントに 1加えてカウントリストを更新し、 カウン トリス卜に存在しなかった場合、 前記重要語のペアのカウントを 1にして力 ゥントリス 卜に新たに保存し、 これらの処理をデータベース内の予め指定し た複数文書について行い、 作成したカウントリス トを元に、 重要語のペアの 重要度を判定することを特徴とする複数重要語抽出プログラム。  The documents in the database are read one by one, key words are searched from the documents, and if there is another key word within a predetermined range defined from the key words searched, certain key words are searched. When an important word existing within the range is searched, the important word pair is sequentially stored in the count list, and the important word pair is searched from the already created force list. If it exists in the count list, the count list is updated by adding 1 to the count of the number of occurrences.If it does not exist in the count list, the count of the important word pair is set to 1 and newly saved in the count list. These processes are performed on a plurality of documents specified in advance in the database, and the importance of a pair of important words is determined based on the created count list. Out program.
1 5 . データベース中の文書に含まれる表層表現を自動抽出し、 該表層表現 から自動構築した重要語の上下階層関係を用いた重要語上下階層関係抽出プ ログラムにおいて、  15 5. In the key word upper / lower hierarchical relation extraction program that automatically extracts the surface expressions contained in the documents in the database and uses the upper / lower hierarchical relations of the important words automatically constructed from the surface expressions,
データベース内の文書を一文書ずつ読み込み、 該文書中から予め作成して おいた表層表現リス 卜に書かれている表層表現を抽出し、 抽出された表層表 現中の上位語部分及び下位語部分に前記重要語解析部 2で抽出した重要語が 含まれるか否かを探索し、 上位語部分及び下位語部分の双方ともに重要語が 探索された場合、 探索された上下重要語のペアを逐次カウントリス卜に保存 し、 既に同一の重要語のペアがカウントリス トに存在した場合、 出現回数の カウントに 1加えてカウントリストを更新し、 カウントリス卜に存在しなか つた場合、 前記上下重要語のペアのカウントを 1にしてカウントリストに新 たに保存し、 これらの処理をデータベース内の予め指定した複数文書につい て行い、 作成したカウントリストを元に重要語の上下階層関係を構築するこ とを特徴とする重要語上下階層関係抽出プログラム。  The documents in the database are read one by one, and the surface expressions written in the surface expression list created beforehand are extracted from the documents, and the upper and lower word portions in the extracted surface expression are extracted. Whether the key words extracted by the key word analysis unit 2 are included.If both the high-order word part and the low-order word part are searched for key words, the searched pair of upper and lower key words is sequentially searched. Saved in the count list, if the same important word pair already exists in the count list, add 1 to the count of the number of occurrences, update the count list, and if it does not exist in the count list, The word pair count is set to 1 and saved in the count list.These processes are performed for a plurality of documents specified in advance in the database, and important words are determined based on the created count list. Key words upper and lower hierarchical relationship extraction program characterized that you build lower hierarchical relationship.
PCT/JP2002/012504 2001-11-30 2002-11-29 Method for automatically extracting related words WO2003046765A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2001/367472 2001-11-30
JP2001367472A JP3553543B2 (en) 2001-11-30 2001-11-30 Related word automatic extraction device, multiple important word extraction program, and upper and lower hierarchy relation extraction program for important words

Publications (1)

Publication Number Publication Date
WO2003046765A1 true WO2003046765A1 (en) 2003-06-05

Family

ID=19177212

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2002/012504 WO2003046765A1 (en) 2001-11-30 2002-11-29 Method for automatically extracting related words

Country Status (2)

Country Link
JP (1) JP3553543B2 (en)
WO (1) WO2003046765A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786991A (en) * 2016-02-18 2016-07-20 中国科学院自动化研究所 Chinese emotion new word recognition method and system in combination with user emotion expression ways

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009081620A1 (en) * 2007-12-26 2009-07-02 T-Terminology, Ltd. Dictionary system
KR101071700B1 (en) 2009-11-04 2011-10-11 동국대학교 산학협력단 Method and apparatus for measuring subject and related terms of document using ontology
JP5208193B2 (en) * 2010-12-28 2013-06-12 ヤフー株式会社 Related word graph creation device, related word graph creation method, related word providing device, related word providing method, and program
JP5117590B2 (en) * 2011-03-23 2013-01-16 株式会社東芝 Document processing apparatus and program
US9183600B2 (en) 2013-01-10 2015-11-10 International Business Machines Corporation Technology prediction
JP6079361B2 (en) * 2013-03-27 2017-02-15 富士通株式会社 Document management apparatus, document management method, and document management program
JP6280859B2 (en) * 2014-11-20 2018-02-14 日本電信電話株式会社 Behavior network information extraction apparatus, behavior network information extraction method, and behavior network information extraction program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5297039A (en) * 1991-01-30 1994-03-22 Mitsubishi Denki Kabushiki Kaisha Text search system for locating on the basis of keyword matching and keyword relationship matching
JPH11203311A (en) * 1998-01-13 1999-07-30 Fujitsu Ltd Device for extracting related word and method therefor and computer readable recording medium for recording related word extraction program
JPH11328182A (en) * 1998-05-20 1999-11-30 Ricoh Co Ltd Device and method for automatic extraction of related word and information storage medium
JP2000222427A (en) * 1999-02-02 2000-08-11 Mitsubishi Electric Corp Related word extracting device, related word extracting method and recording medium with related word extraction program recorded therein

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5297039A (en) * 1991-01-30 1994-03-22 Mitsubishi Denki Kabushiki Kaisha Text search system for locating on the basis of keyword matching and keyword relationship matching
JPH11203311A (en) * 1998-01-13 1999-07-30 Fujitsu Ltd Device for extracting related word and method therefor and computer readable recording medium for recording related word extraction program
JPH11328182A (en) * 1998-05-20 1999-11-30 Ricoh Co Ltd Device and method for automatic extraction of related word and information storage medium
JP2000222427A (en) * 1999-02-02 2000-08-11 Mitsubishi Electric Corp Related word extracting device, related word extracting method and recording medium with related word extraction program recorded therein

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KAZUNORI SATO ET AL.: "Bunsho no jido bunrui ni okeru bun'ya kanrengo jisho no kosatsu", THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS KENKYU HOKOKU, vol. 100, no. 439, 10 November 2000 (2000-11-10), pages 5 - 10, XP002961743 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786991A (en) * 2016-02-18 2016-07-20 中国科学院自动化研究所 Chinese emotion new word recognition method and system in combination with user emotion expression ways
CN105786991B (en) * 2016-02-18 2019-03-15 中国科学院自动化研究所 In conjunction with the Chinese emotion new word identification method and system of user feeling expression way

Also Published As

Publication number Publication date
JP3553543B2 (en) 2004-08-11
JP2003167894A (en) 2003-06-13

Similar Documents

Publication Publication Date Title
US7814099B2 (en) Method for ranking and sorting electronic documents in a search result list based on relevance
KR101479040B1 (en) Method, apparatus, and computer storage medium for automatically adding tags to document
US9275339B2 (en) System and method for probabilistic name matching
JP5010885B2 (en) Document search apparatus, document search method, and document search program
CN1916889B (en) Language material storage preparation device and its method
JP2002521767A (en) Information retrieval device using probability
US7555428B1 (en) System and method for identifying compounds through iterative analysis
JP2010287020A (en) Synonym translation system and synonym translation method
JP5522389B2 (en) Similarity calculation device, similarity calculation method, and program
JP2009193219A (en) Indexing apparatus, method thereof, program, and recording medium
JP4969209B2 (en) Search system
WO2003046765A1 (en) Method for automatically extracting related words
US7072827B1 (en) Morphological disambiguation
JP2006227823A (en) Information processor and its control method
KR20020072092A (en) Real-time Natural Language Question-Answering System Using Unit Paragraph Indexing Method
CN110909532B (en) User name matching method and device, computer equipment and storage medium
JP2000132560A (en) Chinese teletext processing method and processor therefor
KR20030006201A (en) Integrated Natural Language Question-Answering System for Automatic Retrieving of Homepage
JP2002032394A (en) Device and method for preparing related term information, device and method for presenting related term, device and method for retrieving document and storage medium
JP2004013726A (en) Device for extracting keyword and device for retrieving information
JP3249743B2 (en) Document search system
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
JP4015661B2 (en) Named expression extraction device, method, program, and recording medium recording the same
JP2010267047A (en) Apparatus and method for constructing synonym dictionary, and computer program
JP2004133510A (en) Technical literature retrieval system

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase