JP4900947B2

JP4900947B2 - Abbreviation extraction method, abbreviation extraction apparatus, and program

Info

Publication number: JP4900947B2
Application number: JP2007042935A
Authority: JP
Inventors: 裕一郎関口; 吉秀佐藤; 晴美川島; 英範奥田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-02-22
Filing date: 2007-02-22
Publication date: 2012-03-21
Anticipated expiration: 2027-02-22
Also published as: JP2008204399A

Description

本発明は、略語抽出方法、略語抽出装置およびプログラムに係り、特に、大量の文書を入手可能な状況において、ある語句の略語を自動的に抽出する略語抽出方法、略語抽出装置およびプログラムに関する。
The present invention relates to an abbreviation extraction method, an abbreviation extraction apparatus, and a program, and more particularly to an abbreviation extraction method, an abbreviation extraction apparatus, and a program for automatically extracting an abbreviation of a certain phrase in a situation where a large amount of documents are available.

インターネットをはじめとする情報メディアの発達によって、誰であっても容易に情報発信することができる。これによって、様々な作成者による文書が、ネットワーク上に大量に発信され、容易に入出することができる。 With the development of information media including the Internet, anyone can easily send information. As a result, a large number of documents by various creators are transmitted on the network and can be easily entered and exited.

そして、多数の作成者が書いた文書集合において、同じ語句を、文書の作成者が様々に省略した略語で表記すると、同じ事柄であっても、異なる表記であるので、当該文書集合から情報検索する場合、情報要約等、解析処理の精度を下げる原因になる。 And in a document set written by many authors, if the same phrase is written in abbreviations that the document creator omitted in various ways, even if it is the same matter, it is a different notation, so information retrieval from the document set In this case, the accuracy of analysis processing such as information summarization is reduced.

このように、複数の略語と、元となる語句との関係を、自動的に判定することによって、文書集合に対する解析処理の精度を向上することができる。 Thus, by automatically determining the relationship between a plurality of abbreviations and the original phrase, it is possible to improve the accuracy of the analysis processing for the document set.

大量の文書中で入力語句が使用されている状況を分析することによって、その入力語句の異表記を取得する手法は多数提案されている。 Many techniques have been proposed for obtaining different expressions of input phrases by analyzing the situation where the input phrases are used in a large number of documents.

従来、入力語句を、形態素解析処理することによって、品詞ごとに分割し、得られた形態素群中の名詞系または形容詞系の品詞を持つ形態素について、予め決められたパターン定義に従って、先頭の一部分のみを抽出し、得られた各形態素の頭部情報を組み合わせることによって、入力語句の略語を抽出する技術が知られている（たとえば、特許文献１参照）。 Conventionally, an input word is divided into parts of speech by performing morphological analysis processing, and for the morpheme having a noun or adjective part of speech in the obtained morpheme group, only a part of the head according to a predetermined pattern definition Is extracted, and the abbreviation of the input phrase is extracted by combining the obtained head information of each morpheme (for example, see Patent Document 1).

また、入力された文書を形態素解析処理し、得られた各形態素について、その品詞情報と意味属性情報とを用い、予め決められたパターン定義に従って、前後の形態素との接続可能性を判断し、接続可能な形態素によって、構成される入力文書の一部分を、文書中で用いられている略語単語として取得する技術が知られている（たとえば、特許文献２参照）。
特開平１１−１１０４０８号公報特開平６−１６１９９６号公報 In addition, the input document is subjected to morphological analysis processing, and for each obtained morpheme, the part of speech information and semantic attribute information are used to determine the possibility of connection with previous and subsequent morphemes according to a predetermined pattern definition, There is known a technique for acquiring a part of an input document constituted by connectable morphemes as an abbreviation word used in the document (see, for example, Patent Document 2).
JP-A-11-110408 Japanese Patent Laid-Open No. 6-161996

従来の第１の技術と第２の技術とは、入力語句を形態素解析し、その品詞情報を元に、予め決められたパターンに従って、略語を抽出するので、形態素解析による分割が正しく行われない口語的な表現や新語を含む入力語句については、抽出の精度が低下するという問題がある。 In the conventional first technique and second technique, morphological analysis is performed on an input phrase, and abbreviations are extracted according to a predetermined pattern based on the part of speech information. Therefore, division by morphological analysis is not performed correctly. For input phrases including colloquial expressions and new words, there is a problem that the accuracy of extraction decreases.

また、予め決められたパターンに適合しない省略のし方による略語については、抽出不可能であるという問題がある。 In addition, there is a problem that abbreviations based on omissions that do not conform to a predetermined pattern cannot be extracted.

本発明は、入力語句を省略することによって得られる様々な略語について、事前の想定が難しい省略形式による略語であっても、精度高く抽出することができる略語抽出方法、略語抽出装置およびプログラムを提供することを目的とする。
The present invention provides an abbreviation extraction method, an abbreviation extraction device, and a program capable of accurately extracting various abbreviations obtained by omitting input phrases even if they are abbreviations in an abbreviation format that is difficult to assume in advance. The purpose is to do.

本発明は、複数の文書を解析し、入力語句を省略した略語を、自動的に抽出する略語抽出方法において、上記入力語句を構成する全ての文字から、上記入力語句を構成する文字の数よりも少ない数の文字を、上記入力語句における各文字の前後の順番を変えずに取り出した組み合わせの全てを略語候補として得て、記憶装置に記憶する略語候補作成ステップと、上記略語候補作成ステップで作成された複数の上記略語候補のそれぞれについて、上記略語候補を含む文書の集合と、上記入力語句を含む文書の集合とを、外部の文書データベースから取得し、記憶装置に記憶する文書集合取得ステップと、上記略語候補を含む文書の集合と、上記入力語句を含む文書の集合との類似度合いを数値化することによって、上記入力語句についての当該略語候補の略語としての確からしさを示す略語スコアを算出し、記憶装置に記憶する略語スコア算出ステップとを有することを特徴とする略語抽出方法。 The present invention is an abbreviation extraction method for analyzing a plurality of documents and automatically extracting an abbreviation in which an input phrase is omitted. From the number of characters constituting the input phrase from all characters constituting the input phrase. even fewer number of characters, all the combination obtained without changing the front and rear of the order of each character in the input phrase obtained as an abbreviation candidate, and abbreviations candidate generation step of storing in a storage device, by the abbreviation candidate creating step For each of the plurality of created abbreviation candidates, a document set acquisition step of acquiring a set of documents including the abbreviation candidates and a set of documents including the input words / phrases from an external document database and storing them in a storage device And quantifying the degree of similarity between the set of documents including the abbreviation candidates and the set of documents including the input phrases. Abbreviations extraction method characterized by calculating the abbreviations score indicating the likelihood of as an abbreviation of the word candidate, and a abbreviations score calculation step of storing in a storage device.

本発明によれば、形態素構造に依存しない省略のし方によって作成した入力語句の略語を、大量の文書が含まれているデータベースを用いることによって、精度よく抽出することができるという効果を奏する。 According to the present invention, an abbreviation of an input phrase created by an omission method that does not depend on a morpheme structure can be accurately extracted by using a database containing a large amount of documents.

また、本発明によれば、従来存在しない略し方による略語を、精度よく抽出することができるという効果を奏する。
In addition, according to the present invention, an abbreviation based on an abbreviation that does not exist conventionally can be extracted with high accuracy.

発明を実施するための最良の形態は、次の実施例である。 The best mode for carrying out the invention is the following embodiment.

図１は、本発明の実施例１である略語抽出装置１００を示すブロック図である。 FIG. 1 is a block diagram showing an abbreviation extraction apparatus 100 that is Embodiment 1 of the present invention.

略語抽出装置１００には、文書データベースＤＢ１と、語句入力装置Ｗ１と、略語出力装置Ｗ２とが接続されている。 To the abbreviation extraction device 100, a document database DB1, a phrase input device W1, and an abbreviation output device W2 are connected.

文書データベースＤＢ１は、大量の文書を蓄積している。たとえば、Ｗｅｂ上に公開されている文書を、次々と入力し、記録することによって、構築したデータベースである。インターネット上の日記サイト等、新しい文書が逐次更新される情報源であれば、サイト内の文書が更新された場合、新たな文書が作成されたとみなし、収集するようにしてもよい。また、既存のウェブページの検索エンジンを、文書データベースの代わりとして用いるようにしてもよい。 The document database DB1 stores a large amount of documents. For example, it is a database constructed by inputting and recording documents published on the Web one after another. If the information source is such that a new document is sequentially updated, such as a diary site on the Internet, when a document in the site is updated, it may be considered that a new document has been created and collected. An existing web page search engine may be used in place of the document database.

語句入力装置Ｗ１は、略語抽出装置１００へ処理対象語句（入力語句）を入力するために用いる装置であり、キーボード等である。略語出力装置Ｗ２は、略語抽出装置１００が出力する略語を出力する装置であり、ディスプレイ、プリンタ等である。 The phrase input device W1 is a device used to input a processing target phrase (input phrase) to the abbreviation extraction apparatus 100, and is a keyboard or the like. The abbreviation output device W2 is a device that outputs an abbreviation output by the abbreviation extraction device 100, and is a display, a printer, or the like.

略語抽出装置１００は、略語候補作成部１０と、略語スコア算出部２０と、略語判定部３０とを有する。 The abbreviation extraction apparatus 100 includes an abbreviation candidate creation unit 10, an abbreviation score calculation unit 20, and an abbreviation determination unit 30.

略語候補作成部１０は、語句入力装置Ｗ１から入力語句を受け取り、この入力語句に含まれている文字から、各文字の前後の順番を変えずに、当該入力語句を構成する文字の数よりも少ない任意の数の文字を選び出すことによって略語候補を作成する。そして、この作成動作を、あり得る全ての組み合わせについて実行し、複数の略語候補を作成し、略語スコア算出部２０の略語候補バッファ（図示せず）に記録する。なお、上記入力語句を構成する文字の数よりも少ない任意の数の文字を選び出すことによって語句を作成する作業を、あり得る全ての組み合わせについてではなく、その一部の組み合せについて実行するようにしてもよい。 The abbreviation candidate creation unit 10 receives the input phrase from the phrase input device W1, and from the characters included in the input phrase, the number of characters constituting the input phrase is changed without changing the order before and after each character. Create abbreviation candidates by picking out any small number of characters. Then, this creation operation is executed for all possible combinations, a plurality of abbreviation candidates are created, and recorded in an abbreviation candidate buffer (not shown) of the abbreviation score calculation unit 20. It should be noted that the task of creating a word / phrase by selecting an arbitrary number of characters smaller than the number of characters constituting the input word / phrase is performed not for all possible combinations but for some combinations thereof. Also good.

また、入力語句を、略語スコア算出部２０に設けられている入力語句バッファ（（図示せず）に記録する。 Further, the input phrase is recorded in an input phrase buffer (not shown) provided in the abbreviation score calculation unit 20.

入力語句が、たとえば「通商産業省」である場合、「通」、「商」、「産」、「業」、「省」、「通商」、「通産」、「通業」、「通省」、「商産」、「商業」、「商省」、「産業」、「産省」、「業省」、「通商産」、「通商業」、「通商省」、「通産業」、「通産省」、「通業省」、「商業省」、「商産省」、「商業省」、「産業省」、「通商産業」、「通商産省」、「通商業省」、「通産業省」、「商産業省」の３０の略語候補が得られ、これらの略語候補を、略語スコア算出部２０に設けられている略語候補バッファに記録される。 For example, when the input phrase is “Ministry of International Trade and Industry”, “Tong”, “Commerce”, “Production”, “Industry”, “Ministry”, “Trading”, “Trading”, “Commuting”, “Ministry of Economy” ”,“ Commerce ”,“ Commerce ”,“ Commerce Ministry ”,“ Industry ”,“ Ministry of Industry ”,“ Ministry of Industry ”,“ Commerce and Industry ”,“ Commerce ”,“ Ministry of Trade ”,“ “Ministry of International Trade and Industry”, “Ministry of International Trade”, “Ministry of Commerce”, “Ministry of Commerce”, “Ministry of Commerce”, “Ministry of Industry”, “Ministry of Trade and Industry”, “Ministry of Trade and Industry”, “Ministry of Trade and Industry”, “ 30 abbreviation candidates of “Ministry of Industry” and “Ministry of Commerce and Industry” are obtained, and these abbreviation candidates are recorded in the abbreviation candidate buffer provided in the abbreviation score calculation unit 20.

入力語句の略語として確からしい略語候補は、入力語句が用いられている文脈と似たような文脈で用いられる。よって、入力語句の略語として確からしい略語候補を含む文書の集合は、入力語句を含む文書の集合と似た語句によって構成されていると仮定される。 Abbreviation candidates that are probable as abbreviations for input phrases are used in a context similar to the context in which the input phrases are used. Therefore, it is assumed that a set of documents including candidate abbreviations that are probable as abbreviations of input phrases is composed of phrases similar to a set of documents including input phrases.

この仮定に基づいて、略語候補作成部１０が得た複数の略語候補のそれぞれを含む文書の集合を、文書データベースＤＢ１から検索して取得し、この取得した「文書の集合の特徴」と、これと同様に、文書データベースＤＢ１を用いることによって取得した入力語句を含む「文書の集合の特徴」と比較し、類似度合いを求めることによって、略語らしさを示す「略語スコア」を算出する。 Based on this assumption, a set of documents including each of a plurality of abbreviation candidates obtained by the abbreviation candidate creation unit 10 is retrieved from the document database DB1, and the obtained “characteristic of the set of documents” In the same manner, the “abbreviation score” indicating the abbreviation is calculated by comparing the “characteristic of the set of documents” including the input phrase acquired by using the document database DB1 and obtaining the degree of similarity.

上記「文書の集合の特徴」は、文書の集合に含まれている全ての語句について、文書の集合中における出現数を付与した頻度付き語句集合であると考え、これをベクトル化して表現する（語群ベクトルで表現する）。 The above “characteristic of the document set” is considered to be a phrase set with frequency to which all the phrases included in the document set are given the number of appearances in the document set, and is expressed as a vector ( Expressed as a word vector).

式（１）は、２つの頻度付き語句集合（略語候補の頻度付き語句集合と入力語句の頻度付き語句集合）中で使用されている語句の傾向が似ているほど、略語スコアが１に近い値を示し、そうでない場合、その略語スコアが０に近い値を示す。 In the expression (1), the abbreviation score is closer to 1 as the tendency of the phrases used in the two frequency phrase sets (the frequency phrase set of the abbreviation candidates and the frequency phrase set of the input phrases) is similar. If not, the abbreviation score is close to zero.

これに、ある略語候補（たとえば、「通産」）を含む文書集合から得られた頻度付き語句集合（Ｌｗ_ｃ（ｗ））と、入力語句（たとえば、「通商産業省」）を含む文書集合から得られた頻度付き語句集合（Ｌｗ_ｉ（ｗ））とを、式（１）に代入することによって、当該略語候補（たとえば「通産」）についての略語スコアＳ（ｗ_ｃ）を求める。 From this, a phrase set with a frequency (Lw _c (w)) obtained from a document set including a candidate abbreviation (for example, “communication”) and a document set including an input phrase (for example, “Ministry of International Trade and Industry”). By substituting the obtained phrase set with frequency (Lw _i (w)) into Equation (1), the abbreviation score S (w _c ) for the abbreviation candidate (for example, “community”) is obtained.

略語スコア算出部２０は、略語候補作成部１０が得た複数の略語候補のそれぞれを使用している文書の集合を、文書データベースＤＢ１を検索することによって、取得する。また、これと同様に、文書データベースＤＢ１を使用して、入力語句を使用している文書集合を取得し、この取得した略語候補を使用している文書の集合と、入力語句を使用している文書の集合とを解析することによって、入力語句（たとえば、「通商産業省」）に対する当該略語候補（たとえば、「通産」）について、略語としての確からしさを示す略語スコアＳｗ_ｃを算出する。そして、算出された各略語スコアＳｗ_ｃを比較することによって、略語として最も相応しい略語候補を略語として決定する。 The abbreviation score calculation unit 20 acquires a set of documents using each of a plurality of abbreviation candidates obtained by the abbreviation candidate creation unit 10 by searching the document database DB1. Similarly, the document database DB1 is used to acquire a document set using the input phrase, and the set of documents using the acquired abbreviation candidates and the input phrase are used. By analyzing the set of documents, an abbreviation score Sw _c indicating the certainty as an abbreviation is calculated for the abbreviation candidate (for example, “communication”) for the input phrase (for example, “Ministry of International Trade and Industry”). Then, by comparing the calculated abbreviation scores Sw _c , the abbreviation candidate most suitable as an abbreviation is determined as an abbreviation.

図２は、略語スコア算出部２０の動作を示すフローチャートである。 FIG. 2 is a flowchart showing the operation of the abbreviation score calculation unit 20.

略語スコア算出部２０は、処理が開始されると略語候補バッファ中から略語候補を１つ取り出す（Ｓ１）。 When the process is started, the abbreviation score calculation unit 20 extracts one abbreviation candidate from the abbreviation candidate buffer (S1).

次に、文書データベースＤＢ１から、当該略語候補を含み入力語句を含まない文書を全て取得する。この当該略語候補を含み入力語句を含まない文書の数を、Ｎｕｍ_ｃとし、この文書数Ｎｕｍ_ｃを記憶する。そして、上記取得した複数の文書を形態素解析することによって、形態素単位に分割し、得られた形態素の集合から、各形態素ごとに、先に取得した複数の文書中における出現文書数を数えることによって、形態素と、その出現文書数との組の集合である当該略語候補の共起語句の頻度付き集合を作成する（Ｓ２）。 Next, all documents including the abbreviation candidate and not including the input phrase are acquired from the document database DB1. The number of documents including the abbreviation candidate and not including the input word / phrase is defined as Num _c, and the number of documents Num _c is stored. Then, by dividing the obtained plurality of documents into morpheme units, and dividing the obtained documents into morpheme units, for each morpheme, by counting the number of appearing documents in the previously obtained documents. Then, a set with frequency of the co-occurrence word phrases of the abbreviation candidate, which is a set of morphemes and the number of appearing documents, is created (S2).

次に、式（１）で用いる記号表記「Ｌｗ_ｃ（ｗ）」が何を示すかについて説明する。 Next, what the symbol notation “Lw _c (w)” used in Equation (1) indicates will be described.

たとえば、略語候補である「通産」という語句を含み、入力語句である「通商産業省」を含まない文書を、文書データベースＤＢ１から全て取得する。この結果、１０００個の文書の集合が得られたとする。このときに、文書数Ｎｕｍ_ｃは、１０００である。また、この得られた１０００個の文書集合を解析することによって、「日本」という形態素が９００文書に現われ、「景気」という形態素が４００文書に現われたとする。この場合、「日本、９００」、「景気、４００」、……、というように、形態素とその形態素の出現文書数とを組とし、文書集合に現われる全ての形態素について上記組を集めたものが、共起語句の頻度付き集合であり、この共起語句の頻度付き集合を得る。 For example, all documents including the phrase “communication” as an abbreviation candidate and not including the input phrase “Ministry of International Trade and Industry” are acquired from the document database DB1. As a result, it is assumed that a set of 1000 documents is obtained. At this time, the number of documents Num _c is 1000. Further, it is assumed that a morpheme “Japan” appears in 900 documents and a morpheme “economy” appears in 400 documents by analyzing the obtained 1000 document set. In this case, “Japan, 900”, “Economy, 400”,... Is a set of morphemes and the number of appearance documents of the morphemes, and the above sets are collected for all the morphemes that appear in the document set. , A frequencyd set of co-occurrence words and a frequencyd set of the co-occurrence words is obtained.

ここで、略語候補ｗ_ｋの共起語句の頻度付き集合を数式で表現するために、「Ｌｗ_ｃ（ｗ_ｋ）＝ｃ_ｋ」という式を定義する。この式は、ある形態素ｗ_ｋの文書集合中での出現文書数ｃ_ｋを返す式であり、たとえばＬ_通産（日本）＝９００である。上記Ｌｗ_ｃは、略語候補である「通産」という語句を含み入力語句である「通商産業省」を含まない文書集合の中に出てこない形態素に関しては、０の値を示すものとする。以下、何らかの語句ｗ_ｃの共起語句の頻度付き集合を、Ｌｗ_ｘ（ｗ）と記述した場合は、同義の式を定義したものとする。 Here, an expression “Lw _c (w _k ) = c _k ” is defined in order to express the frequency-added set of co-occurrence words of the abbreviation candidate w _k with a mathematical expression. This expression is an expression that returns the occurrence number of documents c _k in the document set in a morpheme w _k, for example, L _{International Trade and Industry} (Japan) = 900. The above Lw _c indicates a value of 0 for a morpheme that does not appear in a document set that includes the phrase “communication” that is an abbreviation candidate but does not include the input phrase “Ministry of International Trade and Industry”. Hereinafter, when a frequency-added set of co-occurrence words of some word / phrase w _c is described as Lw _x (w), a synonymous expression is defined.

処理を簡略化するために、略語候補の共起語句に含める形態素を、一部の品詞のみに絞るようにしてもよい。多くの場合、名詞のみに絞ることによって精度の低下を抑えならが処理の簡略化が実現可能である。また、この場合、形態素解析における解析ミスによって、一部の名詞が「未知語」と解析されることがしばしば起こるので、名詞と未知語とを、略語候補の共起語句に含めるようにしてもよい。 In order to simplify the processing, the morphemes included in the co-occurrence phrases of abbreviation candidates may be limited to only a part of parts of speech. In many cases, it is possible to simplify the processing if the reduction in accuracy is suppressed by focusing only on nouns. Also, in this case, some nouns are often analyzed as “unknown words” due to an analysis error in morphological analysis, so that nouns and unknown words may be included in co-occurrence phrases of abbreviation candidates. Good.

上記と同様に、処理を簡略化するために、形態素解析する文書数を、予め定めた一定数ｃ_ｃのみであるとし、得られた各語句の出現文書数に、Ｎｕｍ_ｃ／Ｃ_ｃを掛け、これが、略語候補を含む全文書が、形態素解析して集計した場合と同様の結果であるとみなすようにしてもよい。つまり、Ｃ_ｃ個の文書集合中で、ある語句が出現する文書数は、Ｎｕｍ_ｃ個の文書中で、上記ある語句が出現する文書数に、（Ｃ_ｃ／Ｎｕｍ_ｃ）を掛けた数であると考えられる。したがって、Ｃ_ｃ個の文書集合中でのある語句の出現文書数に、Ｎｕｍ_ｃ／Ｃ_ｃを掛けることによって、Ｎｕｍ_ｃ個の文書中で、上記ある語句が出現する文書数とほぼ同様の値を得ることができる。 Similarly to the above, in order to simplify the processing, it is assumed that the number of documents to be subjected to morphological analysis is only a predetermined constant number c _c , and the number of appearance documents of each obtained phrase is multiplied by Num _c / C _c . This may be considered to be the same result as when all documents including abbreviation candidates are aggregated by morphological analysis. In other words, in C _c pieces of document set, the document number is terms occur is a Num _c pieces of document in, the number of documents the certain terms occur, the number multiplied by (C c _/ Num _c) It is believed that there is. Therefore, by multiplying the number of documents in which a certain phrase appears in the C _c document set by Num _c / C _c , a value almost the same as the number of documents in which the certain phrase appears in Num _c documents. Can be obtained.

略語候補が、入力語句の一部であるかどうかを判定する（Ｓ３）。たとえば、入力語句「通商産業省」において、「産業」は、入力語句の一部であるが、「通産」は入力語句の一部ではない。 It is determined whether the abbreviation candidate is a part of the input phrase (S3). For example, in the input phrase “Ministry of International Trade and Industry”, “Industry” is a part of the input phrase, but “Mutual trade” is not a part of the input phrase.

Ｓ３で、略語候補が入力語句の一部でないと判断されると、文書データベースＤＢ１から、入力語句を含み、略語候補を含まない文書を全て取得し、Ｓ２と同様にして得られた文書数Ｎｕｍ_ｉと、文書中に含まれている形態素ｗ_ｋと、先に取得した複数の文書中での形態素ｗ_ｋの出現文書数ｃ_ｋとの組の集合である入力語句の共起語句の頻度付き集合Ｌｗ_ｉ（ｗ_ｋ）とを記録する（Ｓ４）。 If it is determined in S3 that the abbreviation candidate is not a part of the input phrase, all documents including the input phrase and not including the abbreviation candidate are obtained from the document database DB1, and the number of documents Num obtained in the same manner as in S2. Frequency of co-occurrence phrases of input phrases, which is a set of _i , the morpheme w _k included in the document, and the number of documents c _k of the morpheme w _k in the plurality of previously acquired documents The set Lw _i (w _k ) is recorded (S4).

略語スコア算出部２０は、所定の略語候補を含む文書集合における出現語句と、その出現回数を集計した当該略語候補の共起語句の頻度付き集合と、入力語句を含む文書集合における出現語句とその出現回数とを集計した入力語句の共起語句の頻度付き集合とを、それぞれをベクトルとみなし、そのコサイン類似度を当該略語候補の略語スコアとする。 The abbreviation score calculation unit 20 includes an appearance word / phrase in a document set including a predetermined abbreviation candidate, a frequency-added set of co-occurrence words / phrases of the abbreviation candidate in which the number of occurrences is counted, an appearance word / phrase in a document set including an input word / phrase, and Each frequency-occurring set of co-occurrence words of the input words and phrases that count the number of occurrences is regarded as a vector, and the cosine similarity is set as the abbreviation score of the abbreviation candidate.

また、略語スコア算出部２０は、処理対象となる略語候補が「四文字熟語」に対する「熟語」のように、入力語句の表記の一部分である場合、当該略語候補を含む文書集合と、入力語句を含む文書集合とを比較するのではなく、当該略語候補を含むが、入力語句を含まない文書集合と、入力語句を含む文書集合とを比較する。これによって、入力語句を含む文書が、略語候補を含む文書集合と入力語句を含む文書集合とに含まれることによる略語抽出精度の影響を除去する。 In addition, the abbreviation score calculation unit 20, when the abbreviation candidate to be processed is a part of the notation of the input phrase such as “idiom” for “four-character idiom”, the document set including the abbreviation candidate, the input phrase Is compared with a document set that includes the abbreviation candidate but does not include an input phrase, and a document set that includes the input phrase. As a result, the influence of the abbreviation extraction accuracy due to the fact that the document including the input word / phrase is included in the document set including the abbreviation candidate and the document set including the input word / phrase is removed.

Ｓ３で、略語候補が入力語句の一部であると判断された場合、文書データベースＤＢ１から、入力語句を含む文書を全て取得し、得られた文書に、Ｓ２での処理と同様の処理を実行し、得た文書数Ｎｕｍ_ｉと、文書中に含まれている各形態素ｗ_ｋと、その出現文書数ｃ_ｋとの組の集合である入力語句の共起語句の頻度付き集合Ｌｗ_ｉ（ｗ_ｋ）とを記録する（Ｓ５）。 If it is determined in S3 that the abbreviation candidate is a part of the input phrase, all documents including the input phrase are acquired from the document database DB1, and the same process as the process in S2 is executed on the obtained document. Then, a set Lw _i (w with frequency) of co-occurrence words / phrases of input phrases that is a set of the obtained document number Num _i , each morpheme w _k included in the document, and the number of appearance documents c _k _k ) is recorded (S5).

略語候補の共起語句の頻度付き集合Ｌｗ_ｃ（ｗ）と、入力語句の共起語句の頻度付き集合Ｌｗ_ｉ（ｗ）とを用い、次の式（１）によって、略語スコアＳ（ｗ_ｃ）を、算出する（Ｓ６）。 Using the frequency set Lw _c (w) of co-occurrence phrases of abbreviation candidates and the frequency set Lw _i (w) of co-occurrence phrases of input words, an abbreviation score S (w _c ) Is calculated (S6).

算出した略語候補ｗ_ｃと、この略語候補ｗ_ｃの略語スコアＳ（ｗ_ｃ）とを組とし、略語判定部３０に設けられている略語スコアバッファに出力する（Ｓ７）。 The calculated abbreviation candidate w _c and the abbreviation score S (w _c ) of this abbreviation candidate w _c are paired and output to the abbreviation score buffer provided in the abbreviation determination unit 30 (S7).

図３は、略語スコアバッファに蓄積されている略語候補と、その略語スコアとの関係の例を示す図である。 FIG. 3 is a diagram illustrating an example of the relationship between abbreviation candidates accumulated in the abbreviation score buffer and the abbreviation score.

略語候補バッファに、まだ略語候補が残っているかどうかを判断し、残っていると判断されれば、Ｓ１に戻り、処理を継続する。残っていないと判断されれば、略語スコア算出部２０の処理を終了する（Ｓ８）。 It is determined whether abbreviation candidates still remain in the abbreviation candidate buffer, and if it is determined that they remain, the process returns to S1 to continue the processing. If it is determined that it does not remain, the process of the abbreviation score calculation unit 20 is terminated (S8).

略語判定部３０は、略語スコア算出部２０の処理が終了すると、略語スコアバッファから略語候補と略語スコアとの組を複数取り出す。取り出された略語候補と略語スコアとの組の複数について、略語スコアの大きい順に並び替え、略語スコアが予め定められた値以上である略語候補と略語スコアとの組のみを、略語出力装置Ｗ２に出力する。
When the process of the abbreviation score calculation unit 20 ends, the abbreviation determination unit 30 takes out a plurality of pairs of abbreviation candidates and abbreviation scores from the abbreviation score buffer. A plurality of pairs of abbreviation candidates and abbreviation scores taken out are rearranged in descending order of abbreviation scores, and only a set of abbreviation candidates and abbreviation scores whose abbreviation scores are greater than or equal to a predetermined value are stored in the abbreviation output device W2. Output.

図４は、本発明の実施例２である略語抽出装置２００を示すブロック図である。 FIG. 4 is a block diagram showing an abbreviation extraction apparatus 200 that is Embodiment 2 of the present invention.

略語抽出装置２００は、外来語辞書データベースＤＢ２と、大量の文書が蓄積されている文書データベースＤＢ１と、略語抽出装置２００へ入力語句を入力するために用いる語句入力装置Ｗ１と、略語抽出装置２００が出力する略語情報を出力する略語出力装置Ｗ２とが接続されている。 The abbreviation extraction device 200 includes a foreign word dictionary database DB2, a document database DB1 in which a large amount of documents are stored, a phrase input device W1 used to input input phrases to the abbreviation extraction device 200, and an abbreviation extraction device 200. An abbreviation output device W2 for outputting abbreviation information to be output is connected.

外来語辞書データベースＤＢ２は、外来語の日本語表記と、この日本語表記に対応する英語表記との組み合わせが大量に蓄積され、外来語の日本語表記に基づいて、これに対応する英語表記を取得することができる。たとえば、外来語の日本語表記である「ベースボール」を、外来語辞書データベースＤＢ２に入力することによって、これに対応する英語表記である「baseball」を取得することができる。 The foreign language dictionary database DB2 stores a large number of combinations of foreign language Japanese notation and English notation corresponding to this Japanese notation. Can be acquired. For example, by inputting “baseball” which is a Japanese expression of a foreign word into the foreign word dictionary database DB2, it is possible to acquire “baseball” which is an English expression corresponding thereto.

略語抽出装置２００は、略語候補作成部１１と、略語スコア算出部２１と、略語判定部３１とを有する。 The abbreviation extraction apparatus 200 includes an abbreviation candidate creation unit 11, an abbreviation score calculation unit 21, and an abbreviation determination unit 31.

略語候補作成部１１は、入力語句における１つまたは複数の外来語を判別し、各外来語を、英語表記の頭文字に置き換えた場合と置き換えない場合との全ての場合について、語句中の一部の語句を使用することによって、略語候補を作成し、略語スコア算出部２１の略語候補バッファに蓄積する。 The abbreviation candidate creation unit 11 discriminates one or a plurality of foreign words in the input word and phrase, and for each case where each foreign word is replaced with an acronym in English notation, Abbreviation candidates are created by using the words of the part, and stored in the abbreviation candidate buffer of the abbreviation score calculation unit 21.

図５は、略語抽出装置２００における略語候補作成部１１の動作を示すフローチャートである。 FIG. 5 is a flowchart showing the operation of the abbreviation candidate creation unit 11 in the abbreviation extraction apparatus 200.

略語候補作成部１１は、語句入力装置Ｗ１から入力語句情報を受け取ると、入力語句を形態素解析し、形態素ごとに分割し、分割語句バッファに記憶する。たとえば、入力語句が「パーソナルコンピュータ専門誌」である場合、３つの形態素「パーソナル」、「コンピュータ」、「専門誌」に分割し、これら３つの文字列からなる配列として分割語句バッファに記憶する（Ｓ２１）。 Upon receiving the input phrase information from the phrase input device W1, the abbreviation candidate creation unit 11 performs morphological analysis on the input phrase, divides it into morphemes, and stores them in the divided phrase buffer. For example, if the input word / phrase is “personal computer specialized magazine”, it is divided into three morphemes “personal”, “computer”, and “specialty magazine” and stored in the divided word buffer as an array of these three character strings ( S21).

図６は、分割語句バッファに格納されている形態素の例を示す図である。 FIG. 6 is a diagram illustrating an example of morphemes stored in the divided phrase buffer.

次に、上記分割語句バッファに記憶されている各形態素について、対応する英語表記があるかどうかを、外来語辞書データベースＤＢ２を使用して調べ、対応する英語表記が存在すれば、その頭文字を、各形態素に対応付け、配列として頭文字付語句バッファに記憶する。つまり、上記分割語句バッファの値が、「パーソナル」、「コンピュータ」、「専門誌」である場合について考える。この場合、「パーソナル」について、英語表記「Ｐｅｒｓｏｎａｌ」を得、「コンピュータ」について英語表記「Ｃｏｍｐｕｔｅｒ」を得たとし、２重配列「「パーソナル」「Ｐ」」、「「コンピュータ」「Ｃ」」、「「専門誌」「」」を、頭文字付語句バッファに記憶する（Ｓ２２）。なお、上記「」は、対応する頭文字が何もないことを示す。 Next, for each morpheme stored in the divided word buffer, the foreign word dictionary database DB2 is used to check whether or not there is a corresponding English notation. In association with each morpheme, it is stored in the initial phrase buffer as an array. That is, consider the case where the value of the divided word buffer is “personal”, “computer”, or “specialized magazine”. In this case, it is assumed that an English notation “Personal” is obtained for “Personal” and an English notation “Computer” is obtained for “Computer”, and a double array “Personal” “P”, “Computer” “C” ”is obtained. , "" Specialized magazine "" "" is stored in the initial phrase buffer (S22). In addition, the above “” indicates that there is no corresponding initial.

図７は、頭文字付語句バッファに格納されている２重配列に含まれている各子配列の例を示す図である。 FIG. 7 is a diagram illustrating an example of each child array included in the double array stored in the initial phrase buffer.

上記頭文字付語句バッファ中の２重配列に含まれる各子配列について、上記分割語句バッファの前から順番に、子配列のどちらか一方（形態素解析した語句、または、その英語表記の頭文字）を取り出すことによって、「変換入力語句」を作成する。なお、図７における１行が１つの２重配列である。図７に示す例では、「パーソナル」、「Ｐ」が１つの子配列である。また、上記「変換入力語句」は、入力語句中に含まれる外来語部分を、その外来語に対応する英語表記の頭文字と入れ替えた語句である。 For each child array contained in the double array in the phrase buffer with the acronym, one of the child arrays (a morpheme-analyzed phrase or an acronym for the English notation) in order from the front of the divided phrase buffer To create a "conversion input phrase". Note that one row in FIG. 7 is one double array. In the example shown in FIG. 7, “personal” and “P” are one child array. The “conversion input word / phrase” is a word / phrase in which a foreign word part included in the input word / phrase is replaced with an initial letter in English corresponding to the foreign word.

変換入力語句を作成する作業を、あり得る全ての組み合わせについて実行し、得られた変換入力語句を、順次、変換入力語句バッファに蓄積する。たとえば、頭文字付語句バッファの内容が、「「パーソナル」「Ｐ」」、「「コンピュータ」「Ｃ」」、「「専門誌」「」」である場合、上記変換入力語句は、「パーソナルコンピュータ専門誌」、「Ｐコンピュータ専門誌」、「パーソナルＣ専門誌」、「ＰＣ専門誌」であり、これを、変換入力語句バッファに蓄積する（Ｓ２３）。 The operation of creating the conversion input phrase is executed for all possible combinations, and the obtained conversion input phrases are sequentially stored in the conversion input phrase buffer. For example, when the contents of the initial phrase buffer are “personal” “P”, “computer” “C”, “specialized magazine” “” ”, the conversion input phrase is“ personal computer ” "Specialized magazine", "P computer specialized magazine", "Personal C specialized magazine", and "PC specialized magazine", which are stored in the converted input phrase buffer (S23).

なお、上記変換入力語句を作成する作業を、あり得る全ての組み合わせについてではなく、その一部について実行するようにしてもよい。 Note that the operation of creating the conversion input phrase may be executed not for all possible combinations but for a part thereof.

変換入力語句バッファから、１つの変換入力語句を取り出し、実施例１における略語候補作成部１０における動作と同様に処理し、複数の略語候補を作成し、略語スコア算出部２１の略語候補バッファに記録する（Ｓ２４）。 One conversion input word / phrase is extracted from the conversion input word / phrase buffer, processed in the same manner as the operation in the abbreviation candidate creation unit 10 in the first embodiment, a plurality of abbreviation candidates are created, and recorded in the abbreviation candidate buffer of the abbreviation score calculation unit 21. (S24).

変換入力語句バッファ中に、変換入力語句が残っているかどうかを判断し、残っていれば、Ｓ２４に戻り、処理を継続する。残っていなければ、略語候補作成部１１の処理を終了する。 It is determined whether or not a conversion input word / phrase remains in the conversion input word / phrase buffer. If it remains, the process returns to S24 to continue the processing. If not, the process of the abbreviation candidate creation unit 11 is terminated.

略語スコア算出部２１は、略語候補作成部１１が作成した複数の略語候補のそれぞれについて、文書データベースＤＢ１を検索して、当該略語候補を使用している文書集合を取得し、これと同様に、文書データベースＤＢ１を検索して、入力語句を使用している文書集合を取得し、これら取得された当該略語候補を含む文書の集合と、当該入力語句を含む文書の集合とを解析し、比較することによって、当該略語候補の確からしさであって、当該入力語句についての略語としての確からしさを示す略語スコアを算出する。 The abbreviation score calculation unit 21 searches the document database DB1 for each of a plurality of abbreviation candidates created by the abbreviation candidate creation unit 11, and acquires a document set using the abbreviation candidates. The document database DB1 is searched to acquire a document set that uses the input phrase, and the set of documents that include the acquired abbreviation candidates and the set of documents that include the input phrase are analyzed and compared. Thus, an abbreviation score indicating the certainty of the candidate abbreviation and indicating the certainty as the abbreviation for the input word / phrase is calculated.

図８は、略語スコア算出部２１の処理を示すフローチャートである。 FIG. 8 is a flowchart showing the processing of the abbreviation score calculation unit 21.

略語スコア算出部２１は、略語候補作成部１１の処理が終了すると、Ｓ１〜Ｓ６までは、実施例１における略語スコア算出部２０における処理と同様の処理を実行する。 The abbreviation score calculation part 21 will perform the process similar to the process in the abbreviation score calculation part 20 in Example 1 from S1-S6, after the process of the abbreviation candidate creation part 11 is complete | finished.

Ｓ６で、略語候補ｗ_ｃの略語スコアＳ（ｗ_ｃ）を得ると、算出した略語候補ｗ_ｃと、当該略語の略語スコアＳ（ｗ_ｃ）と、当該略語を含む文書の全ての数Ｎｕｍ_ｃとを組とし、略語判定部３１に設けられている略語スコアバッファに出力する（Ｓ７）。 In S6, obtains abbreviations candidates _w abbreviation _c score S _{(w c),} and abbreviations candidate _{w c} that calculated, the abbreviations of the abbreviations score S _{(w c),} all numbers Num _c of documents containing the abbreviations Are output to the abbreviation score buffer provided in the abbreviation determination unit 31 (S7).

略語候補バッファに、略語候補が残っているかどうかを判定し、略語候補が残っていれば、Ｓ１に戻り、処理を継続する。略語候補が残っていなければ、Ｓ３９に進む（Ｓ８）。 It is determined whether an abbreviation candidate remains in the abbreviation candidate buffer. If an abbreviation candidate remains, the process returns to S1 to continue the processing. If no abbreviation candidates remain, the process proceeds to S39 (S8).

略語候補バッファに、略語候補が残っていなければ、文書データベースＤＢ１にアクセスし、入力語句を含む文書の数Ｎｕｍ_ｉを取得し、略語スコア算出部５４０の入力語句文書数バッファに記録し、処理を終了する（Ｓ３９）。 If there are no abbreviation candidates remaining in the abbreviation candidate buffer, the document database DB1 is accessed, the number of documents including the input phrase Num _i is obtained, recorded in the input phrase document count buffer of the abbreviation score calculation unit 540, and processed. The process ends (S39).

略語判定部３１は、略語スコア算出部２１の処理が終了すると、略語スコアバッファから、略語候補と略語スコアとの組を複数取り出す。そして、取り出した略語候補と略語スコアとの複数の組について、略語スコアの大きい順に並び替え、予め定められた値以上の略語スコアを有する略語候補を含む文書の総数Ｎｕｍ_ｃを、入力語句を含む文書の総数Ｎｕｍ_ｉで割ったＮｕｍ_ｃ／Ｎｕｍ_ｉが、予め定められた一定の範囲に入っている略語候補のみを、入力語句の略語であると決定し、略語出力装置Ｗ２に出力する。
When the process of the abbreviation score calculation unit 21 ends, the abbreviation determination unit 31 takes out a plurality of pairs of abbreviation candidates and abbreviation scores from the abbreviation score buffer. The plurality of pairs of abbreviation candidates and abbreviation scores taken out are rearranged in descending order of abbreviation scores, and the total number Num _c of documents including abbreviation candidates having abbreviation scores equal to or greater than a predetermined value is included in the input word / phrase Num c _/ Num _i divided by the total number of documents Num _i is only abbreviations candidates contained in certain predetermined range, and determined to be an abbreviation for the input word, and outputs the abbreviation output device W2.

図９は、本発明の実施例３である略語抽出装置３００の構成を示すブロック図である。 FIG. 9 is a block diagram illustrating a configuration of an abbreviation extraction apparatus 300 that is Embodiment 3 of the present invention.

略語抽出装置３００は、辞書読込部４０と、略語候補作成部１０と、略語スコア算出部２０と、略語判定部３０と、辞書書込部５０とを有する。略語候補作成部１０と、略語スコア算出部２０と、略語判定部３０とは、実施例１における構成要素と同じものである。 The abbreviation extraction apparatus 300 includes a dictionary reading unit 40, an abbreviation candidate creation unit 10, an abbreviation score calculation unit 20, an abbreviation determination unit 30, and a dictionary writing unit 50. The abbreviation candidate creation unit 10, the abbreviation score calculation unit 20, and the abbreviation determination unit 30 are the same as the components in the first embodiment.

略語抽出装置３００には、文書データベースＤＢ１と、語句辞書データベースＤＢ３とが接続されている。文書データベースＤＢ１は、大量の文書が蓄積されている。 To the abbreviation extraction device 300, a document database DB1 and a phrase dictionary database DB3 are connected. A large amount of documents are accumulated in the document database DB1.

語句辞書データベースＤＢ３は、略語抽出装置３００への入力対象であり、出力対象である語句を格納している。また、語句辞書データベースＤＢ３は、処理済フラグと略語フラグとを付加して、様々な名称を示す語句を蓄積している。上記「処理済フラグ」は、当該語句が処理済であることを示すフラグであり、上記「略語フラグ」は、当該語句が略語であることを示すフラグである。 The phrase dictionary database DB3 stores phrases that are input targets to the abbreviation extraction apparatus 300 and are output targets. The phrase dictionary database DB3 accumulates phrases indicating various names by adding a processed flag and an abbreviation flag. The “processed flag” is a flag indicating that the word has been processed, and the “abbreviation flag” is a flag indicating that the word is an abbreviation.

図１０は、語句辞書データベースＤＢ３に蓄積されている語句と、処理済フラグと、略語フラグとの例を示す図である。 FIG. 10 is a diagram illustrating examples of phrases, processed flags, and abbreviation flags stored in the phrase dictionary database DB3.

略語抽出装置３００が処理を開始すると、辞書読込部４０が、語句辞書データベースＤＢ３から、処理フラグが「０」である語句（処理されていない語句）を１つ読み込み、この読み込まれた語句を、略語候補作成部１０に入力する。そして、語句辞書データベースＤＢ３に格納されている上記読み込まれた語句の処理フラグを「１」に変更する。 When the abbreviation extraction device 300 starts processing, the dictionary reading unit 40 reads one phrase (unprocessed phrase) whose processing flag is “0” from the phrase dictionary database DB3, and reads the read phrase as Input to the abbreviation candidate creation unit 10. Then, the processing flag of the read phrase stored in the phrase dictionary database DB3 is changed to “1”.

次に、略語候補作成部１０と、略語スコア算出部２０と、略語判定部３０とが、実施例１における処理と同様の処理を行い、予め定められている値以上の略語スコアを持つ略語候補と、略語スコアとの組のみを、辞書書込部５０に出力する。 Next, the abbreviation candidate creation unit 10, the abbreviation score calculation unit 20, and the abbreviation determination unit 30 perform the same processing as the processing in the first embodiment, and abbreviation candidates having abbreviation scores equal to or greater than a predetermined value. And a set of abbreviation scores are output to the dictionary writing unit 50.

辞書書込部５０は、略語判定部３０から、１つまたは複数の略語候補と略語スコアとの組を受け取り、略語候補の１つ１つについて、略語フラグを「１」に変更し、処理フラグを「１」に変更し、語句辞書データベースＤＢ３に書き込む。 The dictionary writing unit 50 receives a set of one or more abbreviation candidates and an abbreviation score from the abbreviation determination unit 30, changes the abbreviation flag to “1” for each of the abbreviation candidates, and sets the processing flag Is changed to “1” and written in the phrase dictionary database DB3.

語句辞書データベースＤＢ３中に、処理フラグが「０」である語句がなくなると、略語抽出装置３００は、処理を終了する。 When there is no phrase having the processing flag “0” in the phrase dictionary database DB3, the abbreviation extraction apparatus 300 ends the process.

つまり、上記実施例は、複数の文書を解析し、入力語句を省略した略語を、自動的に抽出する略語抽出方法において、上記入力語句に含まれている文字から、上記入力語句を構成する文字の数よりも少ない数の文字を取り出し、組み合わせることによって、複数の略語候補を作成し、記憶装置に記憶する略語候補作成ステップと、上記略語候補作成ステップで作成された複数の上記略語候補のそれぞれについて、上記略語候補を含む文書の集合と、上記入力語句を含む文書の集合とを、外部の文書データベースから取得し、記憶装置に記憶する文書集合取得ステップと、上記略語候補を含む文書の集合と、上記入力語句を含む文書の集合との類似度合いを数値化することによって、上記入力語句についての当該略語候補の略語としての確からしさを示す略語スコアを算出し、記憶装置に記憶する略語スコア算出ステップとを有する略語抽出方法の例である。 That is, in the above embodiment, in the abbreviation extraction method for analyzing a plurality of documents and automatically extracting an abbreviation in which the input phrase is omitted, the characters constituting the input phrase from the characters included in the input phrase A plurality of abbreviation candidates are created by extracting and combining a number of characters smaller than the number of abbreviation candidates, and each of the abbreviation candidates created in the abbreviation candidate creation step is stored in the storage device. A document set acquisition step of acquiring a set of documents including the abbreviation candidates and a set of documents including the input words / phrases from an external document database and storing them in a storage device; and a set of documents including the abbreviation candidates And the abbreviation of the candidate abbreviation for the input word / phrase by quantifying the similarity between the input word / phrase and the set of documents including the input word / phrase. Is calculated abbreviations score indicating the examples of abbreviations extraction method with and abbreviations score calculation step of storing in a storage device.

この場合、上記略語候補作成ステップは、入力語句に含まれている外来語部分を、アルファベット表記に置き換え、この外来語部分がアルファベットに置き換えられた語句に含まれている文字を、当該入力語句を構成する文字の数よりも少ない数の文字を取り出して組み合わせることによって、略語候補を作成し、アルファベット表記を用いた略語候補を作成するステップである。 In this case, in the abbreviation candidate creation step, the foreign word part included in the input word / phrase is replaced with an alphabetic notation, and the character included in the word / phrase with the foreign word part replaced with the alphabet is replaced with the input word / phrase. In this step, an abbreviation candidate is created by extracting and combining a smaller number of characters than the number of characters constituting the abbreviation candidate using alphabetical notation.

また、上記略語スコア算出ステップは、上記略語候補の上記略語スコアを算出する際に、当該略語候補が、入力語句の一部分である場合、当該略語候補を含み、入力語句を含まない文書の集合と、入力語句を含む文書の集合との間で、文書内容が類似している度合いである類似度合いを数値化するステップと、当該略語候補が、入力語句の一部分でない場合、当該略語候補を含む文書の集合と、入力語句を含む文書の集合との間で、文書内容が類似している度合いである類似度合いを数値化するステップとを有するステップである。 In the abbreviation score calculation step, when calculating the abbreviation score of the abbreviation candidate, if the abbreviation candidate is a part of the input phrase, a set of documents including the abbreviation candidate and not including the input phrase Quantifying the degree of similarity between the contents of the documents including the input words and phrases, and if the abbreviation candidates are not part of the input words, the documents including the abbreviation candidates And a step of quantifying the degree of similarity that is the degree of similarity of the document contents between the set of documents including the input phrase.

さらに、上記略語スコア算出ステップは、上記略語候補の上記略語スコアを算出する際に、当該略語候補を含む文書の集合における出現語句を集計することによって得られる当該略語候補の共起語句の頻度付き集合に含まれている語句と、上記入力語句を含む文書集合における出現語句を集計することによって得られる入力語句の共起語句の頻度付き集合に含まれている語句との一致度合いを、各語句の出現頻度に基づいて、数値化するステップである。 Further, the abbreviation score calculating step includes the frequency of co-occurrence phrases of the abbreviation candidate obtained by aggregating appearance words and phrases in a set of documents including the abbreviation candidate when calculating the abbreviation score of the abbreviation candidate. For each word, the degree of coincidence between the words included in the set and the words included in the co-occurrence set of co-occurrence words of the input words obtained by aggregating the words appearing in the document set including the above input words This is a step of digitizing based on the appearance frequency of.

しかも、上記略語スコア算出ステップは、当該略語候補ｗ_ｃを含む文書の集合における出現語句を集計することによって得られる当該略語候補の共起語句の頻度付き集合を、あらゆる語句ｗ_ｋを代入した際に、当該略語候補ｗ_ｃを含む文書の集合における語句ｗ_ｋの出現数ｃ_ｋを返す関数Ｌｗ_ｃ（ｗ_ｋ）＝ｃ_ｋとして定義し、上記入力語句ｗ_ｉを含む文書の集合における出現語句を集計することによって得られる入力語句の共起語句の頻度付き集合を、あらゆる語句ｗ_ｌを代入した際に、上記入力語句ｗ_ｉを含む文書の集合における語句ｗ_ｌの出現数ｃ_ｌを返す関数Ｌｗ_ｉ（ｗ_ｌ）＝ｃ_ｌとして定義し、これらＬｗ_ｃとＬｗ_ｉとを用いた上記式（１）によって、当該略語候補ｗ_ｃの略語スコアＳ（ｗ_ｃ）を算出するステップである。 In addition, when the abbreviation score calculation step substitutes every phrase w _k for the frequency-added set of co-occurrence phrases of the abbreviation candidate obtained by aggregating the appearing phrases in the set of documents including the abbreviation candidate w _c. Is _defined as a function Lw _c (w _k ) = c _k that returns the number of occurrences c _k of the phrase w _k in the document set including the abbreviation candidate w _c, and appears in the document set including the input phrase w _i When a set with frequency of co-occurrence phrases of input phrases obtained by aggregating is substituted for every phrase w _l , the number of occurrences c _l of the phrase w _l in the set of documents including the input phrase w _i is returned. defined as a function _{_{_{Lw i (w l) = c}}} l, calculated by the above equation was used with these Lw _c and Lw _i (1), abbreviations for the abbreviation candidate _{w c} score S a _{(w c)} step It is.

そして、上記略語スコア算出ステップの後に、上記略語スコアが、予め定められた値以上である略語候補のみを出力し、上記入力語句の略語として確からしい略語語句のみを出力するステップを有する。 Then, after the abbreviation score calculating step, there is a step of outputting only abbreviation candidates whose abbreviation score is equal to or greater than a predetermined value and outputting only probable abbreviations as abbreviations of the input phrases.

しかも、上記略語スコア算出ステップの後に、各略語候補を含む文書の総数を、上記入力語句を含む文書の総数で割った値が、予め定められている範囲に入る略語候補のみを出力し、頻繁に入力語句の略語として使われる略語語句のみを出力するステップを有する。 In addition, after the abbreviation score calculation step, only the abbreviation candidates that are obtained by dividing the total number of documents including each abbreviation candidate by the total number of documents including the input word phrase within a predetermined range are output. And outputting only abbreviations used as abbreviations for input phrases.

また、上記実施例は、複数の文書を解析し、入力語句を省略した略語を、自動的に抽出する略語抽出装置において、上記入力語句に含まれている文字から、上記入力語句を構成する文字の数よりも少ない数の文字を取り出し、組み合わせることによって、複数の略語候補を作成し、記憶装置に記憶する略語候補作成手段と、上記略語候補作成手段が作成した複数の上記略語候補のそれぞれについて、上記略語候補を含む文書の集合と、上記入力語句を含む文書の集合とを、外部の文書データベースから取得し、記憶装置に記憶する文書集合取得手段と、上記略語候補を含む文書の集合と、上記入力語句を含む文書の集合との類似度合いを数値化することによって、上記入力語句についての当該略語候補の略語としての確からしさを示す略語スコアを算出し、記憶装置に記憶する略語スコア算出手段とを有する略語抽出装置の例である。 In the above-described embodiment, in the abbreviation extraction apparatus that analyzes a plurality of documents and automatically extracts abbreviations from which input phrases are omitted, the characters constituting the input phrases from the characters included in the input phrases The abbreviation candidate creation means for creating a plurality of abbreviation candidates by taking out and combining a number of characters smaller than the number of the abbreviations and storing them in the storage device, and each of the plurality of abbreviation candidates created by the abbreviation candidate creation means A document set acquisition means for acquiring a set of documents including the abbreviation candidates and a set of documents including the input words / phrases from an external document database and storing them in a storage device; and a set of documents including the abbreviation candidates; The abbreviation S indicates the probability of the abbreviation candidate as an abbreviation for the input phrase by quantifying the degree of similarity with the set of documents including the input phrase. Calculating the A, examples of abbreviations extractor with and abbreviations score calculating means for storing in a storage device.

そして、上記実施例をプログラムの発明として把握することができる。つまり、上記実施例は、請求項１〜請求項７のいずれか１項記載の略語抽出方法をコンピュータに実行させるプログラムの例である。
And the said Example can be grasped | ascertained as invention of a program. That is, the said Example is an example of the program which makes a computer perform the abbreviation extraction method of any one of Claims 1-7.

本発明の実施例１である略語抽出装置１００を示すブロック図である。It is a block diagram which shows the abbreviation extraction apparatus 100 which is Example 1 of this invention. 略語スコア算出部２０の動作を示すフローチャートである。4 is a flowchart showing the operation of an abbreviation score calculation unit 20. 略語スコアバッファに蓄積されている略語候補と、その略語スコアとの関係の例を示す図である。It is a figure which shows the example of the relationship between the abbreviation candidate accumulate | stored in the abbreviation score buffer, and the abbreviation score. 本発明の実施例２である略語抽出装置２００を示すブロック図である。It is a block diagram which shows the abbreviation extraction apparatus 200 which is Example 2 of this invention. 略語抽出装置２００における略語候補作成部１１の動作を示すフローチャートである。5 is a flowchart showing the operation of the abbreviation candidate creation unit 11 in the abbreviation extraction apparatus 200. 分割語句バッファに格納されている形態素の例を示す図である。It is a figure which shows the example of the morpheme stored in the division | segmentation phrase buffer. 頭文字付語句バッファに格納されている２重配列に含まれている各子配列の例を示す図である。It is a figure which shows the example of each child arrangement | sequence contained in the double arrangement | sequence currently stored in the prefix word phrase buffer. 略語スコア算出部２１の処理を示すフローチャートである。4 is a flowchart showing processing of an abbreviation score calculation unit 21. 本発明の実施例３である略語抽出装置３００の構成を示すブロック図である。It is a block diagram which shows the structure of the abbreviation extraction apparatus 300 which is Example 3 of this invention. 語句辞書データベースＤＢ３に蓄積されている語句と、処理済フラグと、略語フラグとの例を示す図である。It is a figure which shows the example of the phrase accumulate | stored in phrase dictionary database DB3, a processed flag, and an abbreviation flag.

Explanation of symbols

１００…略語抽出装置、
１０…略語候補作成部、
２０…略語スコア算出部、
３０…略語判定部、
Ｗ１…語句入力装置、
Ｗ２…略語出力装置、
ＤＢ１…文書データベース、
２００…略語抽出装置、
１１…略語候補作成部、
２１…略語スコア算出部、
３１…略語判定部、
ＤＢ２…外来語辞書データベース、
３００…略語抽出装置、
４０…辞書読込部、
５０…辞書書込部、
ＤＢ３…語句辞書データベース。 100: Abbreviation extraction device,
10 ... abbreviation candidate creation part,
20: Abbreviation score calculation unit,
30 ... Abbreviation determination unit,
W1 ... phrase input device,
W2 ... Abbreviation output device,
DB1 ... Document database,
200 ... abbreviation extraction device,
11 ... abbreviation candidate creation part,
21 ... Abbreviation score calculation unit,
31 ... Abbreviation determination unit,
DB2 ... Foreign word dictionary database,
300 ... Abbreviation extraction device,
40 ... Dictionary reading part,
50 ... Dictionary writing unit,
DB3: phrase dictionary database.

Claims

In an abbreviation extraction method that analyzes a plurality of documents and automatically extracts abbreviations from which input phrases are omitted,
From all the characters constituting the input word, abbreviation candidate all combinations of fewer characters than the number of characters constituting the input word, it was removed without changing the order of the front and rear of each character in the input phrase An abbreviation candidate creation step obtained and stored in a storage device;
For each of the plurality of abbreviation candidates created in the abbreviation candidate creation step, a set of documents including the abbreviation candidates and a set of documents including the input words / phrases are acquired from an external document database and stored in a storage device. A document set acquisition step for storing;
By calculating the similarity between the set of documents including the abbreviation candidate and the set of documents including the input phrase, an abbreviation score indicating the probability as the abbreviation of the abbreviation candidate for the input phrase is calculated. An abbreviation score calculation step stored in the storage device;
An abbreviation extraction method characterized by comprising:

In an abbreviation extraction method that analyzes a plurality of documents and automatically extracts abbreviations from which input phrases are omitted,
A division step of dividing the input word / phrase into morphemes and storing them in a storage device;
For each morpheme, if there is a foreign language notation corresponding to the morpheme in the foreign language database in which the combination of the original language notation and the corresponding foreign language notation is stored, the initial of the foreign language notation is acquired and stored. A foreign language indicia acquisition step stored in the device;
Either the morpheme or the acronym of the foreign word notation corresponding to the morpheme is combined with all the morphemes constituting the input word / phrase without changing the order of the morpheme, and the converted input word / phrase is stored in the storage device. A conversion input phrase acquisition step;
All combinations in which the number of characters smaller than the number of characters constituting the conversion input phrase is extracted from all the characters constituting the conversion input phrase without changing the order of the characters in the conversion input phrase An abbreviation candidate creation step for obtaining the abbreviation candidate as abbreviation candidates and storing them in a storage device;
For each of the plurality of abbreviation candidates created in the abbreviation candidate creation step, a set of documents including the abbreviation candidates and a set of documents including the input words / phrases are acquired from an external document database and stored in a storage device. A document set acquisition step for storing;
By calculating the similarity between the set of documents including the abbreviation candidate and the set of documents including the input phrase, an abbreviation score indicating the probability as the abbreviation of the abbreviation candidate for the input phrase is calculated. An abbreviation score calculation step stored in the storage device;
An abbreviation extraction method characterized by comprising:

In claim 1,
The abbreviation score calculation step includes:
When calculating the abbreviation score of the abbreviation candidate, if the abbreviation candidate is a part of the input phrase, a set of documents including the abbreviation candidate and not including the input phrase, and a set of documents including the input phrase Quantifying the degree of similarity, which is the degree to which the document contents are similar to each other;
When the abbreviation candidate is not a part of the input word / phrase, the degree of similarity, which is the degree of similarity of the document contents, between the set of documents including the candidate abbreviation and the set of documents including the input word / phrase is quantified. Steps and;
An abbreviation extraction method comprising the steps of:

In claim 1,
When calculating the abbreviation score of the abbreviation candidate, the abbreviation score calculating step adds a frequencyized set of co-occurrence phrases of the abbreviation candidate obtained by aggregating appearing phrases in a set of documents including the abbreviation candidate. The degree of coincidence between the included words and the words included in the frequency-based set of co-occurrence words of the input words obtained by aggregating the appearing words in the document set including the above input words An abbreviation extraction method characterized in that it is a step of digitizing based on frequency.

In claim 1 or claim 4,
The abbreviation score calculation step includes:
The frequency with a set of co-occurrence word of the abbreviations candidates obtained by aggregating the appearance phrase in the set of documents including the abbreviation candidate w _c, upon substituting any word w _k, including the abbreviation candidate w _c A function Lw _c (w _k ) = c _k that returns the number of occurrences c _k of the phrase w _k in a set of documents,
The frequency with a set of co-occurrence phrase input phrase obtained by aggregating the appearance phrase in the set of documents containing the input word w _i, upon substituting any word w _l, documents containing the input word w _i defined as word _{w l} function _Lw i _(w l) to return the number of occurrences _{c l} a = _{c l} in the set of,
An abbreviation extraction method, which is a step of calculating an abbreviation score S (w _c ) of the abbreviation candidate w _c by the following formula (1) using these Lw _c and Lw _i .

In claim 1,
After the abbreviation score calculating step, there is a step of outputting only abbreviation candidates whose abbreviation score is equal to or more than a predetermined value and outputting only probable abbreviations as abbreviations of the input phrases. Abbreviation extraction method.

In claim 1,
After the abbreviation score calculation step, only the abbreviation candidates in which a value obtained by dividing the total number of documents including each abbreviation candidate by the total number of documents including the input phrase is within a predetermined range is output and frequently input An abbreviation extraction method comprising a step of outputting only abbreviations and phrases used as abbreviations of phrases.

In an abbreviation extraction device that analyzes a plurality of documents and automatically extracts abbreviations from which input phrases are omitted,
From all the characters constituting the input word, abbreviation candidate all combinations of fewer characters than the number of characters constituting the input word, it was removed without changing the order of the front and rear of each character in the input phrase Abbreviation candidate creation means obtained and stored in a storage device;
For each of the plurality of abbreviation candidates created by the abbreviation candidate creation means, a set of documents including the abbreviation candidates and a set of documents including the input words / phrases are acquired from an external document database and stored in a storage device. A document set acquisition means to perform;
By calculating the similarity between the set of documents including the abbreviation candidate and the set of documents including the input phrase, an abbreviation score indicating the probability as the abbreviation of the abbreviation candidate for the input phrase is calculated. Abbreviation score calculation means stored in the storage device;
An abbreviation extraction device characterized by comprising:

In an abbreviation extraction device that analyzes a plurality of documents and automatically extracts abbreviations from which input phrases are omitted,
Dividing means for dividing the input words into morphemes and storing them in a storage device;
For each morpheme, if there is a foreign language notation corresponding to the morpheme in the foreign language database in which the combination of the original language notation and the corresponding foreign language notation is stored, the initial of the foreign language notation is acquired and stored. Means for acquiring initials in foreign languages stored in the apparatus;
Either the morpheme or the acronym of the foreign word notation corresponding to the morpheme is combined with all the morphemes constituting the input word / phrase without changing the order of the morpheme, and the converted input word / phrase is stored in the storage device. Conversion input phrase acquisition means;
All combinations in which the number of characters smaller than the number of characters constituting the conversion input phrase is extracted from all the characters constituting the conversion input phrase without changing the order of the characters in the conversion input phrase An abbreviation candidate creating means for obtaining the abbreviation candidate and storing it in the storage device;
For each of the plurality of abbreviation candidates created by the abbreviation candidate creation means, a set of documents including the abbreviation candidates and a set of documents including the input words / phrases are acquired from an external document database and stored in a storage device. A document set acquisition means for storing;
By calculating the similarity between the set of documents including the abbreviation candidate and the set of documents including the input phrase, an abbreviation score indicating the probability as the abbreviation of the abbreviation candidate for the input phrase is calculated. Abbreviation score calculation means stored in the storage device;
An abbreviation extraction device characterized by comprising:

The program which makes a computer perform the abbreviation extraction method of any one of Claims 1-7.