JP2007213157A

JP2007213157A - Example sentence retrieval device and example sentence retrieval method

Info

Publication number: JP2007213157A
Application number: JP2006030103A
Authority: JP
Inventors: Masateru Rikitoku; 正輝力徳
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 2006-02-07
Filing date: 2006-02-07
Publication date: 2007-08-23

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that the retrieval of an example sentence which can be used for document preparation is not efficient. <P>SOLUTION: A pattern extraction part 40 is provided with a text integration part 42 for reading a document set designated by a user among document sets stored in a document storage part 70, and integrates the document set into text data; a conversion part 44 for performing conversion for segmenting the text data for every sentence; an extraction execution part 46 for extracting a word column pattern by the algorithm of system pattern mining; and a pattern information writing part 48 for making a pattern information storage part 80 store the extracted pattern by associating it with its appearance frequency and a text including it. A retrieval part 60 is provided with a retrieval execution part 62 for detecting a pattern from a keyword input by the user; a pattern output part 64 for displaying the list and frequency of the detected patterns; and an example sentence output part 66 for displaying an example sentence including the pattern selected by the user. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、検索技術に関し、特に語句の用例を参照するための用例文検索装置およびそれに適用される用例文検索方法に関する。 The present invention relates to search technology, and more particularly to an example sentence search apparatus for referring to phrase examples and an example sentence search method applied thereto.

外国語などの文章を作成する際に語句の用例文を参照することは、正しい文書作成に対して有効な手段である。近年では文書を電子ファイルとして保存することが一般的になり、情報処理装置の処理速度や電子データの記憶容量などハードウェアの性能も向上しているため、それらを利用して、サーバなどに保存された多くの文書からキーワードを元に用例文を取得する用例文検索の技術が研究されている。例えばkwic(Keyword In Context)索引付けによる検索ツールとして、入力した語句から単語リストやその単語がどのような文脈で使用されたかを示すkwicコンコーダンスを作成するソフトウェアも提供されている。 Referencing the example sentence of a phrase when creating a sentence such as a foreign language is an effective means for creating a correct document. In recent years, it has become common to save documents as electronic files, and hardware performance such as the processing speed of information processing devices and the storage capacity of electronic data has improved. Techniques for retrieving example sentences from a large number of documents obtained based on keywords have been studied. For example, as a search tool based on kwic (Keyword In Context) indexing, software that creates a word list and kwic concordance indicating the context in which the word was used from an input phrase is also provided.

一方、ウェブサイトが提供する検索エンジンを利用して用例文検索を行うこともできる。この場合ユーザはキーワードを検索エンジンに入力することにより、検索結果として表示されたキーワード周辺の文字列を閲覧し、キーワードの用例を確認する。さらにキーワード周辺の文字列のパターンからマッチングを行い、頻出フレーズに相当する部分文字列を抽出、表示し、その部分文字列を含む文書を用例文として検索するシステムなども提供されている（例えば非特許文献１参照）。
藤本宏凉ら，ローカルコーパスからのテキストマイニングツール：PortableKiwi,言語処理学会第１１回年次大会発表論文集 On the other hand, an example sentence search can be performed using a search engine provided by a website. In this case, the user inputs a keyword into a search engine, browses a character string around the keyword displayed as a search result, and confirms an example of the keyword. In addition, a system is also provided that performs matching from a character string pattern around a keyword, extracts and displays a partial character string corresponding to a frequent phrase, and searches a document including the partial character string as an example sentence (for example, non- Patent Document 1).
Hiromoto Fujimoto et al., Text Mining Tool from Local Corpus: PortableKiwi, Proc. Of the 11th Annual Conference of the Language Processing Society of Japan

ところがユーザが頻出フレーズについて調べようとした場合、例えばkwicコンコーダンスを用いると、フレーズに含まれるキーワードから作成された大量なコンコーダンスから自分でフレーズを確認していく作業が必要となる。また、キーワード周辺の情報から用例の部分的情報は取得できるが、コンコーダンスの表示が文単位でないと文章の全体的な把握が困難な場合がある。検索エンジンを利用して頻出フレーズを抽出するシステムにおいては、検索対象がウェブページであるため、分野に特化した検索ができず、表示結果が膨大となり検索の効率が悪い。また、フレーズマッチングのシステムを用いた場合、単純なキーワード検索では、キーワードの後方の文脈のみが考慮されるため、必要な情報が取得できない場合がある。 However, when a user tries to check a frequent phrase, for example, when kwic concordance is used, it is necessary to check the phrase by himself from a large number of concordances created from keywords included in the phrase. In addition, partial information of the example can be acquired from information around the keyword, but it may be difficult to grasp the entire sentence unless the display of concordance is in sentence units. In a system that uses a search engine to extract frequent phrases, the search target is a web page, so a search specialized for the field cannot be performed, and the display result becomes enormous, resulting in poor search efficiency. When a phrase matching system is used, a simple keyword search only considers the context behind the keyword, so that necessary information may not be acquired.

本発明はこうした状況に鑑みてなされたものであり、その目的は、用例文に係る有用な情報をユーザが効率的に取得できる技術を提供することにある。 This invention is made | formed in view of such a condition, The objective is to provide the technique in which a user can acquire the useful information which concerns on an example sentence efficiently.

本発明のある態様は、用例文検索装置に関する。この用例文検索装置は、ユーザが指定した文書集合から所定の規則に従い単語列パターンを抽出するパターン抽出部と、パターン抽出部が抽出した単語列パターンと、文書集合に属する文書に含まれ、当該単語列パターンを含む文章とを対応付けたパターン情報を記憶するパターン情報記憶部と、検索キーワード入力を受け付け、検索キーワードおよび検索キーワードと関連性を有する語句のいずれかを含む単語列パターンをパターン情報記憶部が記憶するパターン情報から検出する検索実施部と、検索実施部が検出した単語列パターンおよびそれに対応付けられた文章の少なくとも一部を出力するパターン情報出力部と、を備えることを特徴とする。 One embodiment of the present invention relates to an example sentence search device. This example sentence search device includes a pattern extraction unit that extracts a word string pattern from a document set designated by a user according to a predetermined rule, a word string pattern extracted by the pattern extraction unit, and a document that belongs to the document set. A pattern information storage unit that stores pattern information that associates a sentence including a word string pattern, and a pattern information that receives a search keyword input and includes a word string pattern that includes any one of the search keyword and a phrase related to the search keyword. A search execution unit that is detected from pattern information stored in the storage unit, and a pattern information output unit that outputs at least a part of a word string pattern detected by the search execution unit and a sentence associated therewith, To do.

ここで「単語列パターン」は所定の数の単語で構成され、順序情報を含んだ単語集合である。連続して同一である２つの単語集合を同一の単語列パターンとしてもよいし、不連続だが同一の単語集合が同一の順序で出現する２つの単語集合を同一の単語列パターンとしてもよい。したがって「所定の規則」とは、抽出する単語列パターンを構成する単語の数または数の範囲、連続同一を同一単語列パターンとするか不連続も許すか、文書集合に出現する頻度のしきい値、すなわち何度出現したら単語列パターンとして抽出するか、など、抽出に関連する条件であればいずれでもよく、また、抽出に利用する手法なども含んでよい。 Here, the “word string pattern” is a word set including a predetermined number of words and including order information. Two word sets that are the same in succession may be the same word string pattern, or two word sets that are discontinuous but appear in the same order may be the same word string pattern. Therefore, the “predetermined rule” refers to the number of words constituting the word string pattern to be extracted or the range of the numbers, whether the same word string pattern is allowed to be consecutively identical or discontinuous, or the frequency that appears in the document set Any value may be used as long as it is a condition related to extraction, such as whether it is extracted as a word string pattern when it appears many times, and a technique used for extraction may also be included.

「検索キーワードと関連性を有する語句」とは、検索キーワードと同一の意味を有し異なる言語の語または句、検索キーワードの類義語、またはそれらの組み合わせなど、一般的に検索キーワードと対応付けることのできる語句のいずれでもよい。 “Phrase having relevance to search keyword” can generally be associated with a search keyword such as a word or phrase in a different language having the same meaning as the search keyword, a synonym of the search keyword, or a combination thereof. Any of the phrases can be used.

本発明の別の態様は、用例文検索方法に関する。この用例文検索方法は、検索キーワード入力を受け付けるステップと、あらかじめ記憶された、ユーザ指定の文書集合から所定の規則に従い抽出された単語列パターンと、文書集合に属する文書に含まれ、当該単語列パターンを含む文章とを対応付けたパターン情報を参照し、検索キーワードおよび検索キーワードと関連性を有する語句のいずれかを含む単語列パターンを検出するステップと、検出された単語列パターンのうち少なくとも一部の単語列パターンに対応付けられた文章を出力するステップと、を含むことを特徴とする。 Another aspect of the present invention relates to an example sentence search method. This example sentence search method includes a step of receiving a search keyword input, a word string pattern that is stored in advance and extracted from a user-specified document set according to a predetermined rule, and a document belonging to the document set. A step of detecting a word string pattern including any one of a search keyword and a phrase related to the search keyword with reference to pattern information in which a sentence including the pattern is associated; and at least one of the detected word string patterns Outputting a sentence associated with the word string pattern of the part.

本発明のさらに別の態様は、記録媒体に関する。この記録媒体は、文書集合から所定の規則に従い抽出された単語列パターンと、当該単語列パターンの文書集合における出現頻度と、文書集合に属する文書に含まれ、当該単語列パターンを含む文章とを対応付けて記録することを特徴とする。 Yet another embodiment of the present invention relates to a recording medium. The recording medium includes a word string pattern extracted from a document set according to a predetermined rule, an appearance frequency of the word string pattern in the document set, and a sentence included in the document belonging to the document set and including the word string pattern. It is characterized by recording in association with each other.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and a representation of the present invention converted between a method, an apparatus, a system, etc. are also effective as an aspect of the present invention.

本発明によれば、ユーザは所望の語句の用例などの情報を容易に確認することができる。 According to the present invention, the user can easily confirm information such as examples of desired phrases.

図１は本実施の形態における用例文検索装置の全体的な構成を示している。用例文検索装置１０は、用例文検索装置１０を統括的に制御するとともに、蓄積された文書から単語列パターン（以下、単にパターンとも呼ぶ）を抽出し、検索処理を行うプロセッサ１５を含む。用例文検索装置１０はさらに、ユーザが入力指示を行う入力装置２０、文書データを蓄積して記憶する文書記憶部７０、抽出したパターンの情報を記憶するパターン情報記憶部８０と、検索結果を出力する出力装置３０を含む。プロセッサ１５、入力装置２０、出力装置３０、文書記憶部７０、パターン情報記憶部８０は、バス９０によって相互にデータの伝送を行う。 FIG. 1 shows the overall configuration of an example sentence search apparatus according to the present embodiment. The example sentence search apparatus 10 includes a processor 15 that performs overall control of the example sentence search apparatus 10 and extracts a word string pattern (hereinafter also simply referred to as a pattern) from the stored document and performs a search process. The example sentence search device 10 further includes an input device 20 for a user to input instructions, a document storage unit 70 for storing and storing document data, a pattern information storage unit 80 for storing extracted pattern information, and a search result output. Output device 30. The processor 15, the input device 20, the output device 30, the document storage unit 70, and the pattern information storage unit 80 mutually transmit data via the bus 90.

用例文検索装置１０は文書作成装置、または情報処理装置を兼ねていてもよい。この場合、プロセッサ１５は文書作成機能や電子メール作成機能を提供するアプリケーションソフトウェアをさらに実行してもよく、入力装置２０、出力装置３０は、それらのアプリケーションソフトウェアに適応した入力データ、出力データの処理をそれぞれ行う。 The example sentence search device 10 may also serve as a document creation device or an information processing device. In this case, the processor 15 may further execute application software that provides a document creation function and an e-mail creation function, and the input device 20 and the output device 30 process input data and output data adapted to the application software. Do each.

入力装置２０はキーボード、マウス、トラックボールなど一般的に用いられる入力装置のいずれか、またはその組み合わせでよく、文書記憶部７０に記憶した文書集合からパターンを抽出する指示や、検索するキーワードなどの入力をユーザが行うためのインターフェースである。文書記憶部７０およびパターン情報記憶部８０は、ハードディスクや、ＤＶＤ（Digital Versatile Disk）、ＣＤ（Compact Disk）などの記録媒体の読取装置、ＤＲＡＭ（Dynamic Random Access Memory）、ＳＲＡＭ（Static Random Access Memory）などのメモリなど、データ量や検索装置の形態に応じたハードウェアから適宜選択する。 The input device 20 may be any one of commonly used input devices such as a keyboard, a mouse, a trackball, or a combination thereof, such as an instruction to extract a pattern from a document set stored in the document storage unit 70, a keyword to be searched, etc. This is an interface for the user to input. The document storage unit 70 and the pattern information storage unit 80 are a hard disk, a reading device for a recording medium such as a DVD (Digital Versatile Disk) or a CD (Compact Disk), a DRAM (Dynamic Random Access Memory), an SRAM (Static Random Access Memory). Such as a memory, etc., is appropriately selected from hardware according to the amount of data and the form of the search device.

文書記憶部７０は、用例文検索装置１０にネットワークを介して接続したサーバなどに備えられていてもよい。文書記憶部７０には、特定分野の論文や電子メールなどユーザが参照したいカテゴリに属し、完成された複数の文書データを記憶させる。例えば英語の論文を作成するユーザは、自分が過去に閲覧した同分野の英語論文データを蓄積していったり、同分野の論文誌に過去に発表された論文の電子データを入手して記憶させたりしてよい。本実施の形態では、同分野の論文やアブストラクト、あるいは同じ種類の文書など、同一のカテゴリに属する文書を文書集合としてパターンの抽出を行い、用例文の検索対象とすることにより、カテゴリで特有の言い回しや語句の用法、定型句などを効率よく検索できる。 The document storage unit 70 may be provided in a server connected to the example sentence search device 10 via a network. The document storage unit 70 stores a plurality of completed document data belonging to a category that the user wants to refer to, such as a paper in a specific field or an e-mail. For example, a user who creates an English paper accumulates English paper data in the same field that he has browsed in the past, or obtains and stores electronic data of papers previously published in a journal in the same field. You may do it. In the present embodiment, a pattern belonging to the same category, such as a paper or abstract in the same field, or a document of the same type, is extracted as a document set, and is searched for example sentences. You can efficiently search for phrases, phrase usage, and fixed phrases.

文書記憶部７０にはカテゴリごとに複数の文書集合のデータを記憶させてもよい。この場合ユーザは、自分の作成したい文書のカテゴリなどに合わせて一の文書集合を選択して後に述べるパターン抽出を行う。パターン抽出に先立ち、文書データはテキストデータに変換されるため、文書記憶部７０に記憶させる文書データは、当該変換処理に対応できるフォーマットを有する。 The document storage unit 70 may store data of a plurality of document sets for each category. In this case, the user selects one document set in accordance with the category of the document he / she wants to create and performs pattern extraction described later. Prior to pattern extraction, document data is converted to text data, so the document data stored in the document storage unit 70 has a format that can be used for the conversion process.

図２はプロセッサ１５の構成をより詳細に示している。プロセッサ１５は、文書記憶部７０に記憶された文書データ、またはユーザが選択した文書集合の文書データに含まれるパターンを抽出し、パターン情報ファイルを生成するパターン抽出部４０、および、ユーザが入力したキーワードなどを含むパターンおよび用例文の検索を行う検索部６０を含む。 FIG. 2 shows the configuration of the processor 15 in more detail. The processor 15 extracts a pattern included in the document data stored in the document storage unit 70 or the document data of the document set selected by the user, and generates a pattern information file, and the user inputs A search unit 60 for searching for patterns including example keywords and example sentences is included.

パターン抽出部４０は、入力装置２０におけるユーザの入力指示に従い、文書記憶部７０に記憶された文書データを読み出し、テキストデータへ変換するテキスト化部４２、テキストデータを１行１文の１つのテキストファイルに変換する変換部４４、１行１文のテキストファイルから所定のアルゴリズムにより頻出するパターンを抽出する抽出実施部４６、抽出したパターンとその頻度、およびそのパターンを含む文章とを対応付けたパターン情報を、パターン情報ファイルとしてパターン情報記憶部８０に記憶させるパターン情報書き込み部４８を含む。 The pattern extraction unit 40 reads out the document data stored in the document storage unit 70 in accordance with a user input instruction in the input device 20, converts the text data into text data, and converts the text data into one text per line. Conversion unit 44 for converting to a file, extraction execution unit 46 for extracting a pattern that frequently appears from a text file of one sentence per line, a pattern that associates the extracted pattern with its frequency, and a sentence including the pattern A pattern information writing unit 48 for storing information in the pattern information storage unit 80 as a pattern information file is included.

検索部６０は、入力装置２０におけるユーザの検索キーワード入力に従い、パターン情報記憶部８０に記憶されたパターン情報ファイルのデータから検索キーワードを含むパターンを検出する検索実施部６２、検索キーワードを含むパターンのリストとそれぞれの頻度を出力装置３０に出力するパターン出力部６４、および、入力装置２０におけるユーザのパターン選択指示に従い、選択されたパターンを含む文章を用例文として出力装置３０に出力する用例文出力部６６を含む。 The search unit 60 is a search execution unit 62 that detects a pattern including a search keyword from data of a pattern information file stored in the pattern information storage unit 80 in accordance with a user's search keyword input in the input device 20. A pattern output unit 64 that outputs the list and the respective frequencies to the output device 30 and an example sentence output that outputs a sentence including the selected pattern as an example sentence to the output device 30 in accordance with a user's pattern selection instruction in the input device 20 Part 66 is included.

図２において、様々な処理を行う機能ブロックとして記載される各要素は、ハードウェア的には、ＣＰＵ、メモリ、その他のＬＳＩで構成することができ、ソフトウェア的には、言語処理機能のあるプログラムなどによって実現される。したがって、これらの機能ブロックがハードウェアのみ、ソフトウェアのみ、またはそれらの組合せによっていろいろな形で実現できることは当業者には理解されるところであり、いずれかに限定されるものではない。 In FIG. 2, each element described as a functional block for performing various processes can be configured by a CPU, a memory, and other LSIs in terms of hardware, and a program having a language processing function in terms of software. Etc. Therefore, it is understood by those skilled in the art that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof, and is not limited to any one.

ここで抽出実施部４６が行う、テキストデータからの単語列パターン抽出について説明する。今、次のような英文データがあるとする。
(1) Three types of thick branes e de Sitter and Sitter brane are considered.
(2) The cases of Dirac Proca and Maxwell fields are considered.
(3) Some issues related to quantum anomaly induced effects due to matter are considered.
(4) The example of a five dimensional BF theory with a boundary brane is considered. Here, the word string pattern extraction from the text data performed by the extraction execution unit 46 will be described. Assume that the following English data exists.
(1) Three types of thick branes e de Sitter and Sitter brane are considered .
(2) The cases of Dirac Proca and Maxwell fields are considered .
(3) Some issues related to quantum anomaly induced effects due to matter are considered .
(4) The example of a five dimensional BF theory with a boundary brane is considered .

これらのデータにおいて、連続単語列“are considered.”が頻度３で、不連続単語列“brane 〜 considered.”が頻度２で出現している。したがってこれらは、この英文を含む文書のカテゴリでは使用頻度の高い定型のパターンと考えられる。抽出実施部４６はこのような定型のパターンを、文書データから抽出する。 In these data, the continuous word string “are considered.” Appears at frequency 3 and the discontinuous word string “brane to considered.” Appears at frequency 2. Therefore, these are considered to be regular patterns that are frequently used in the category of documents including English sentences. The extraction execution unit 46 extracts such a fixed pattern from the document data.

上述のように連続するアイテム列（単語、品詞、属性等）の集合から、あるしきい値以上の頻度で頻出するアイテム列を抽出する手法は系列パターンマイニングと呼ばれる。例えば「ＡＣＤ」、「ＡＢＣ」、「ＣＢＡ」、「ＡＡＢ」というアイテム列の集合があるとする。系列パターンマイニングのアルゴリズムによると、このアイテム列集合から「Ａ＊Ｂ」というパターンの頻度は２、「Ａ＊Ｃ」というパターンの頻度は２、という情報を得ることができる。ここではパターンに含まれるアイテム列は連続、不連続のどちらでも設定でき、上記の例では不連続のパターンも含んでいる。系列パターンマイニングについては多くの研究がなされており、本実施の形態においてはこの系列パターンマイニングの手法のいずれかを導入できる。これにより、現実的な処理時間で頻出パターンを抽出できる。 As described above, a technique for extracting an item string that frequently appears at a frequency equal to or higher than a certain threshold value from a set of continuous item strings (words, parts of speech, attributes, etc.) is called sequence pattern mining. For example, it is assumed that there is a set of item strings “ACD”, “ABC”, “CBA”, and “AAB”. According to the sequence pattern mining algorithm, information that the frequency of the pattern “A * B” is 2 and the frequency of the pattern “A * C” is 2 can be obtained from the item string set. Here, the item string included in the pattern can be set to either continuous or discontinuous. In the above example, the discontinuous pattern is also included. Many studies have been made on sequence pattern mining, and in this embodiment, any of these sequence pattern mining techniques can be introduced. Thereby, a frequent pattern can be extracted in a realistic processing time.

例えば系列パターンマイニングの手法として、n-gram PrefixSpanのアルゴリズムを導入してもよい（工藤拓ら、言語情報を利用したテキストマイニング、言語処理学会全国大会ＮＬＰ−２００２、２００２）。この手法は、チャンキングや係り受け解析といった自然言語処理ツールを使用し、半構造化したデータから意味を反映したパターンを抽出する。 For example, an n-gram PrefixSpan algorithm may be introduced as a method of sequence pattern mining (Taku Kudo et al., Text Mining Using Language Information, National Association of Language Processing Associations NLP-2002, 2002). This method uses natural language processing tools such as chunking and dependency analysis to extract patterns reflecting meaning from semi-structured data.

次に上記の構成による本実施の形態の動作について説明する。図３はパターン文書記憶部７０に記憶された文書データからパターンとそれに係る情報を抽出してパターン情報ファイルを生成し、パターン情報記憶部８０に記憶させる手順を示している。まずユーザからの入力装置２０に対する入力指示により、テキスト化部４２は文書記憶部７０に記憶された複数の文書を読み出し、記憶されたフォーマットからテキストデータへ変換する（Ｓ１０）。文書記憶部７０に複数のカテゴリの文書集合が記憶されている場合は、ユーザが指定した文書集合に対して変換処理を行う。次に変換部４４は、テキストデータを１行１文の１つのテキストファイルへ変換する（Ｓ１２）。生成されたテキストファイルは、パターン情報記憶部８０に記憶させる。 Next, the operation of the present embodiment having the above configuration will be described. FIG. 3 shows a procedure for extracting a pattern and related information from the document data stored in the pattern document storage unit 70 to generate a pattern information file and storing it in the pattern information storage unit 80. First, in response to an input instruction from the user to the input device 20, the text conversion unit 42 reads a plurality of documents stored in the document storage unit 70, and converts the stored format into text data (S10). When document sets of a plurality of categories are stored in the document storage unit 70, conversion processing is performed on the document set designated by the user. Next, the conversion unit 44 converts the text data into one text file with one sentence per line (S12). The generated text file is stored in the pattern information storage unit 80.

次に抽出実施部４６は変換されたテキストファイルから、n-gram PrefixSpanなどの系列パターンマイニングのアルゴリズムによって頻出パターンを抽出する（Ｓ１４）。ここではあるパターンを「頻出パターン」とする出現頻度のしきい値を、２度、５度、などあらかじめプログラム内で設定しておく。そしてそれ以上の頻度で抽出されたパターンを「頻出パターン」として記憶する。同様に、パターンとして抽出される単語列の長さも２単語以上、４単語以上などとプログラム内で設定する。あるいは長い単語列のパターンは頻度のしきい値を低くするなど、パターンとして抽出する単語列の長さと頻度のしきい値とを組み合わせて変化させてもよい。このような設定もプログラム内で行うことができる。 Next, the extraction execution unit 46 extracts a frequent pattern from the converted text file by a sequence pattern mining algorithm such as n-gram PrefixSpan (S14). Here, an appearance frequency threshold for setting a certain pattern as a “frequent pattern” is set in advance in the program, such as twice or five times. And the pattern extracted more frequently is stored as a “frequent pattern”. Similarly, the length of the word string extracted as a pattern is also set in the program as 2 words or more, 4 words or more. Alternatively, the length of the word string extracted as a pattern and the threshold value of the frequency may be changed in combination, such as reducing the threshold value of the frequency of the long word string pattern. Such setting can also be performed in the program.

抽出実施部４６が抽出した頻出パターンは、パターン情報書き込み部４８によってパターン情報記憶部８０に記憶される（Ｓ１６）。このとき、頻出パターンと、抽出を行った文書集合における頻度、および、当該頻出パターンを含む文章の識別情報などを対応付けて、パターン情報記憶部８０に書き込む。識別情報は、例えばＳ１２で生成しパターン情報記憶部８０に保存したテキストファイル内の該当文章の格納領域を示すポインタなどでよい。パターン情報記憶部８０には、一度のパターン抽出において抽出対象となった文書集合ごとにパターン情報ファイルを記憶させてよい。以上の手順により、ユーザが参照したい文書集合に含まれる頻出パターンと、それを含む文章などの情報を格納したデータベースが完成する。過去に生成されパターン情報記憶部８０に保存されたパターン情報ファイルは、同じ文書集合の用例文検索においてはそのまま利用することができる。 The frequent pattern extracted by the extraction execution unit 46 is stored in the pattern information storage unit 80 by the pattern information writing unit 48 (S16). At this time, the frequent pattern, the frequency in the extracted document set, the identification information of the sentence including the frequent pattern, and the like are associated with each other and written in the pattern information storage unit 80. The identification information may be, for example, a pointer indicating the storage area of the corresponding sentence in the text file generated in S12 and stored in the pattern information storage unit 80. The pattern information storage unit 80 may store a pattern information file for each document set to be extracted in one pattern extraction. With the above procedure, a database storing information such as frequent patterns included in a document set that the user wants to refer to and sentences including the patterns is completed. The pattern information file generated in the past and stored in the pattern information storage unit 80 can be used as it is in the example sentence search for the same document set.

図４は文書作成時などにユーザが用例文検索を行う際の処理手順を示している。まずユーザは入力装置２０により検索したいキーワードを入力する（Ｓ２０）。この際、パターン情報記憶部８０に複数のパターン情報ファイルが存在する場合は、あらかじめどのパターン情報ファイルから検索を行うかを指定する。すると検索実施部６２は、指定されたパターン情報ファイルをパターン情報記憶部８０から特定し、当該キーワードを含む頻出パターンを検出する（Ｓ２２）。次にパターン出力部６４は、検出されたパターンとその頻度とからなるデータを出力装置３０に出力する（Ｓ２４）。 FIG. 4 shows a processing procedure when the user performs an example sentence search when creating a document. First, the user inputs a keyword to be searched using the input device 20 (S20). In this case, if there are a plurality of pattern information file in the pattern information storing unit 80 specifies whether to search in advance from which pattern information file. Then, the search execution unit 62 identifies the designated pattern information file from the pattern information storage unit 80, and detects a frequent pattern including the keyword (S22). Next, the pattern output unit 64 outputs data including the detected pattern and its frequency to the output device 30 (S24).

ユーザは必要に応じて出力されたパターンの中から、用例文を確認したいパターンを入力装置２０によって選択する（Ｓ２６）。すると用例文出力部６６は、パターン情報記憶部８０のパターン情報ファイルを参照して、選択されたパターンに対応付けられた識別情報に基づきパターン情報記憶部８０のテキストファイルから当該パターンを含む文章を全て読み出し、用例文として出力装置３０に出力する（Ｓ２６）。これによりユーザは、参照したいカテゴリに属する文書において頻出するパターンを知ることができるとともに、パターンごとに用例文を確認することができる。 The user selects a pattern for which an example sentence is to be confirmed from the patterns output as necessary by the input device 20 (S26). Then, the example sentence output unit 66 refers to the pattern information file in the pattern information storage unit 80, and reads a sentence including the pattern from the text file in the pattern information storage unit 80 based on the identification information associated with the selected pattern. All are read out and output to the output device 30 as an example sentence (S26). As a result, the user can know patterns that frequently appear in documents belonging to the category that the user wants to refer to, and can also check example sentences for each pattern.

図５は本実施の形態において出力装置３０に相当する表示装置に表示される用例文検索画面の一例を示している。用例文検索画面１００は、パターン情報記憶部８０に記憶された複数のパターン情報ファイルから選択を行うファイル選択コマンド１０２、検索したいキーワードを入力するキーワード入力欄１０４、検索実行を指示入力する「検索」実行ボタン１０５、ファイル選択コマンド１０２によって選択されたパターン情報ファイルの名前を表示するデータベース表示欄１０６、検索結果のパターンとその頻度を表示するパターン／頻度表示欄１０８、および選択されたパターンを含む用例文を表示する用例文表示欄１１６を含む。 FIG. 5 shows an example of an example sentence search screen displayed on a display device corresponding to the output device 30 in the present embodiment. The example sentence search screen 100 includes a file selection command 102 for selecting from a plurality of pattern information files stored in the pattern information storage unit 80, a keyword input field 104 for inputting a keyword to be searched, and “search” for inputting a search execution instruction. An execution button 105, a database display field 106 for displaying the name of the pattern information file selected by the file selection command 102, a pattern / frequency display field 108 for displaying the pattern of the search result and its frequency, and the selected pattern. An example sentence display column 116 for displaying example sentences is included.

ユーザがファイル選択コマンド１０２を入力装置２０に含まれるマウスカーソルなどにより選択すると、プルダウンメニューによってパターン情報記憶部８０に記憶されている複数のパターン情報ファイルの名前が一覧表示される。ユーザがそのうちのいずれかを選択すると、データベース表示欄１０６にそのパターン情報ファイルの名前が表示され、検索実施部６２の検索対象となる。図５の例では、「論文」という名前のパターン情報ファイルを選択している。 When the user selects the file selection command 102 with a mouse cursor or the like included in the input device 20, a list of names of a plurality of pattern information files stored in the pattern information storage unit 80 is displayed by a pull-down menu. When the user selects one of them, the name of the pattern information file is displayed in the database display field 106 and becomes a search target of the search execution unit 62. In the example of FIG. 5, a pattern information file named “paper” is selected.

続いて図４のＳ２０においてユーザがキーワードをキーワード入力欄１０４に入力する。図５の例では「consider」という単語が入力されている。そしてユーザが「検索」実行ボタン１０５により確定入力を行うと、検索実施部６２は、パターン情報記憶部８０に記憶された「論文」という名前のパターン情報ファイルから「consider」を含み頻出パターンとして記憶された全てのパターンを検出する。そしてパターン出力部６４は、パターン／頻度表示欄１０８のパターン表示欄１１０に検出したパターンを、頻度表示欄１１２にそのパターンの頻度を表示する。図５の例ではパターン表示欄１１０に「we consider the」、「considering the」など、「consider」を含むパターンが頻度順に表示されている。 Subsequently, in S <b> 20 of FIG. 4, the user inputs a keyword into the keyword input field 104. In the example of FIG. 5, the word “consider” is input. Then, when the user inputs a confirmation with the “search” execution button 105, the search execution unit 62 stores “consider” from the pattern information file named “paper” stored in the pattern information storage unit 80 and stores it as a frequent pattern. All the patterns that have been detected are detected. The pattern output unit 64 displays the detected pattern in the pattern display field 110 of the pattern / frequency display field 108 and the frequency of the pattern in the frequency display field 112. In the example of FIG. 5, patterns including “consider” such as “we consider the” and “considering the” are displayed in the pattern display column 110 in order of frequency.

次にユーザは図４のＳ２６において、パターン／頻度表示欄１０８に表示されたパターンからマウスカーソルなどによってあるパターンを選択入力する。図５の例では「is considered」なるパターンが枠１１４で囲まれ、選択されていることを示している。すると用例文出力部６６は、パターン情報記憶部８０に記憶された文書のテキストファイルから、選択されたパターンを含む文章を文単位で全て読み出し、用例文表示欄１１６に表示する。この際、選択されたパターンが文章中のどこに出現しているかがわかるように枠１１８でパターンを囲ったり、太字で表示したりしてもよい。 Next, in S26 of FIG. 4, the user selects and inputs a certain pattern from the patterns displayed in the pattern / frequency display field 108 with a mouse cursor or the like. In the example of FIG. 5, a pattern “is considered” is surrounded by a frame 114 to indicate that it is selected. Then, the example sentence output unit 66 reads all sentences including the selected pattern from the text file of the document stored in the pattern information storage unit 80 in units of sentences, and displays them in the example sentence display column 116. At this time, the pattern may be surrounded by a frame 118 or displayed in bold so that it can be seen where the selected pattern appears in the sentence.

ある英文アブストラクトコーパスと、ある国際会議論文集の２つの文書集合を対象に、本実施の形態を実際に適用した。英文アブストラクトコーパスは総英文数65889、データ容量が8.0メガバイト、国際会議論文集は総英文数45115、データ容量が4.6メガバイトである。これらの文書集合のどちらにおいても、図３に示したパターン情報の生成処理を数十秒で完了して図４に示した検索処理を行うことができ、十分実用性が保証されていることが確認された。 The present embodiment was actually applied to two document sets of a certain English abstract corpus and a certain international conference paper collection. The English abstract corpus has a total number of English texts of 65889 and a data capacity of 8.0 megabytes. The international conference papers have a total of 45115 texts and a data capacity of 4.6 megabytes. In both of these document sets, the pattern information generation process shown in FIG. 3 can be completed in several tens of seconds and the search process shown in FIG. 4 can be performed, so that practicality is sufficiently guaranteed. confirmed.

また、ある論文アブストラクトを検索対象として本実施の形態を適用した場合の検索結果例を表１および表２に示す。ここでのパターン抽出条件は、抽出する頻度のしきい値を５、抽出するパターンは最小２単語、最大６単語の連続した単語列とした。表１は「consider」を検索キーワードとした場合、表２は「study」を検索キーワードとした場合にパターン／頻度表示欄１０８に表示されるパターンおよび頻度を表している。ここでは文書中、「considered」といった過去形などの変化形は全て原形に正規化する処理をプログラム中で行っている。正規化するかどうかは、検索対象となる文書の量などによって例えば自動的に定めたり、ユーザが指定できるようにする。 Tables 1 and 2 show examples of search results when the present embodiment is applied to a certain paper abstract as a search target. Here, the pattern extraction condition is that the extraction frequency threshold is 5, and the extracted pattern is a continuous word string of a minimum of 2 words and a maximum of 6 words. Table 1 shows patterns and frequencies displayed in the pattern / frequency display column 108 when “consider” is used as a search keyword and Table 2 shows when “study” is used as a search keyword. Here, in the document, the process of normalizing all the changed forms such as “considered” in the document to the original form is performed in the program. Whether or not to normalize is determined automatically, for example, depending on the amount of documents to be searched, or can be specified by the user.

表１および表２から、この論文アブストラクトの文書集合においては、「consider」および「study」はともに受動表現が多用されることがわかる。また両者はほぼ同じ意味で用いられる場合があるが、ユーザは表１、表２から選択したパターンを含む用例文を参照することにより、どちらの表現を用いるかを選択することができる。 From Table 1 and Table 2, it can be seen that passive expressions are frequently used for both “consider” and “study” in the document set of this paper abstract. In some cases, the two terms are used with almost the same meaning, but the user can select which expression to use by referring to the example sentence including the pattern selected from Table 1 or Table 2.

以上述べた本実施の形態によれば、ユーザが参考にしたいカテゴリの文書集合からパターン情報を生成し、そのパターン情報のみに絞ってキーワード検索を行うことができるため、利用率の少ないパターンを排除しやすく、その集合において定型とされる頻出パターンを取得しやすい。したがって一般的には同じ意味を有する熟語だが、あるカテゴリでは一方はほとんど使われないなど、カテゴリによる用法、文法の偏り、カテゴリ独特の言い回し、定型句、それらが使用される文脈などの知識を効率よく取得し、自分の作成文書に生かすことができる。 According to this embodiment described above, pattern information can be generated from a document set of a category that the user wants to refer to, and keyword search can be performed only on the pattern information, thereby eliminating patterns with low utilization. It is easy to obtain, and it is easy to acquire a frequent pattern that is a fixed pattern in the set. Therefore, idioms that generally have the same meaning, but one is rarely used in some categories, such as category usage, grammatical bias, category-specific phrases, boilerplates, and contexts in which they are used efficiently. You can get it well and use it in your documents.

カテゴリは、例えば論文、電子メールなど比較的大きな分類や、論文を細分化した物理論文、工学論文などの分類、物理論文をさらに細分化した、ある学会の論文集や直近１年間で発表されたある論文誌の論文など、ユーザが容易に指定でき、パターン情報ファイルも容易に生成できる。したがって上述したカテゴリ特有の言い回しなどのほか、局所的、一時的な流行、傾向の把握や、内容的な検索など、ユーザの細かいニーズにも応じることのできる、臨機応変な検索機能が実現できる。 Categories, for example, papers, e-mails, relatively large classifications, physical papers that subdivided papers, classifications of engineering papers, etc. A paper such as a paper in a journal can be easily specified by a user, and a pattern information file can be easily generated. Therefore, in addition to the above-mentioned category-specific phrases, it is possible to realize an ad hoc search function that can meet the detailed needs of the user, such as grasping local and temporary trends, trends, and content searches.

またそのカテゴリにおけるパターンの使用状況を頻度などから概観しやすく、キーワードの語感や使用傾向を把握しやすい。さらに選択されたパターンを含む用例文のみを文単位で表示するため、必要最低限の用例文のみを効率的に取得できる。これによりユーザは、頻出パターンのより詳細な用例を調べることができ、それを模倣することにより正確な文章作成を効率的に行うことができる。 In addition, it is easy to overview the usage status of patterns in the category from the frequency, etc., and it is easy to grasp the word feeling and usage tendency of keywords. Furthermore, since only the example sentences including the selected pattern are displayed in sentence units, only the minimum necessary example sentence can be efficiently acquired. As a result, the user can examine a more detailed example of the frequent pattern, and by imitating it, it is possible to efficiently create an accurate sentence.

また系列パターンマイニングのアルゴリズムを利用してパターンの抽出を行うため、あいまいなキーワードに対しても検索を行ってパターンのリストを表示でき、所望のパターンを特定することが可能である。特定に際しては、各パターンの用例文を参照することができるため、最適なパターンを選択しやすい。 In addition, since a pattern is extracted by using a sequence pattern mining algorithm, it is possible to search for ambiguous keywords and display a list of patterns, and to specify a desired pattern. In specifying, since the example sentences of each pattern can be referred to, it is easy to select an optimal pattern.

以上、本発明を実施の形態をもとに説明した。この実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. This embodiment is an exemplification, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are also within the scope of the present invention. is there.

本実施の形態では入力したキーワードを含むパターンをパターン情報ファイルから検出したが、キーワードに基づきハードディスクなどに記憶した辞書のデータベースを検索し、その結果得られた語句を新たなキーワードとしてパターンを検索してもよい。これにより例えば日本語のキーワードを入力し、その日本語に対応する英語のパターンとその用例文を確認することができる。辞書としては和英、和仏などの言語変換辞書のほか、類義語辞書を導入することができる。これによりあいまいなキーワードに対して類義語拡張を行うことができる。 In this embodiment, a pattern including an input keyword is detected from the pattern information file. However, a dictionary database stored in a hard disk or the like is searched based on the keyword, and a pattern is searched using the obtained phrase as a new keyword. May be. Thereby, for example, a Japanese keyword can be input, and an English pattern and an example sentence corresponding to the Japanese keyword can be confirmed. In addition to language conversion dictionaries such as Japanese-English and Japanese-French, synonym dictionaries can be introduced. This allows synonym expansion for ambiguous keywords.

本実施の形態では用例文検索に特化した装置の説明を行った。本発明の実施の態様はこれに限られず、同様の機能を提供するアプリケーションソフトウェアとして、パーソナルコンピュータなどにおいて他のアプリケーションソフトウェアと同様に実行するようにしてもよい。また文書作成アプリケーションや電子メールアプリケーションなど文章入力を行うアプリケーションに同様の機能を組み込むプラグインとしてもよく、ユーザがパターンや用例文を選択することにより、作成中の文書に自動的に当該パターンや用例文が書き込まれるようにしてもよい。 In the present embodiment, an apparatus specialized for example sentence search has been described. The embodiment of the present invention is not limited to this, and may be executed in the same manner as other application software in a personal computer or the like as application software that provides similar functions. Also, it may be a plug-in that incorporates the same function into an application that inputs text, such as a document creation application or an e-mail application, and the pattern or usage is automatically added to the document being created by the user selecting a pattern or example sentence. An example sentence may be written.

またパターン抽出部４０の機能と検索部６０の機能は同一の装置に備えていなくてもよい。例えばパターン抽出部４０によるパターン情報ファイルの生成をあらかじめ別の装置で行っておき、それを記録した記録媒体を検索部６０の機能を有する装置において読み取り、検索を行ってもよいし、ネットワークを介してパターン情報ファイルをダウンロードして検索に用いてもよい。 Further, the function of the pattern extraction unit 40 and the function of the search unit 60 may not be provided in the same device. For example, a pattern information file generated by the pattern extraction unit 40 may be generated in advance by another device, and a recording medium on which the pattern information file is recorded may be read and searched by a device having the function of the search unit 60, or via a network. The pattern information file may be downloaded and used for searching.

本実施の形態における用例文検索装置の全体的な構成を示す図である。It is a figure which shows the whole structure of the example sentence search apparatus in this Embodiment. 本実施の形態の用例文検索装置におけるプロセッサの構成をより詳細に示す図である。It is a figure which shows the structure of the processor in the example sentence search apparatus of this Embodiment in detail. 本実施の形態においてパターンの情報を抽出し保存する手順を示すフローチャートである。It is a flowchart which shows the procedure which extracts and preserve | saves the information of a pattern in this Embodiment. 本実施の形態において用例文検索を行う際の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence at the time of performing an example sentence search in this Embodiment. 本実施の形態において用例文検索装置に表示される用例文検索画面の一例を示す図である。It is a figure which shows an example of the example sentence search screen displayed on an example sentence search apparatus in this Embodiment.

Explanation of symbols

１０用例文検索装置、１５プロセッサ、２０入力装置、３０出力装置、４０パターン抽出部、４２テキスト化部、４４変換部、４６抽出実施部、４８パターン情報書き込み部、６０検索部、６２検索実施部、６４パターン出力部、６６用例文出力部、７０文書記憶部、８０パターン情報記憶部。
10 example sentence search device, 15 processor, 20 input device, 30 output device, 40 pattern extraction unit, 42 text conversion unit, 44 conversion unit, 46 extraction execution unit, 48 pattern information writing unit, 60 search unit, 62 search execution unit 64 pattern output unit, 66 example sentence output unit, 70 document storage unit, 80 pattern information storage unit.

Claims

A pattern extraction unit that extracts a word string pattern from a document set designated by a user according to a predetermined rule;
A pattern information storage unit that stores pattern information in which the word string pattern extracted by the pattern extraction unit is associated with a sentence that is included in the document belonging to the document set and includes the word string pattern;
A search execution unit that receives a search keyword input and detects from the pattern information stored in the pattern information storage unit a word string pattern including any of the search keyword and a phrase related to the search keyword;
A pattern information output unit for outputting at least a part of the word string pattern detected by the search execution unit and the sentence associated therewith;
An example sentence search device comprising:

The example pattern search apparatus according to claim 1, wherein the pattern extraction unit extracts the word string pattern using a sequence pattern mining technique.

A pattern information storage unit for storing pattern information in which a word string pattern extracted from a document set according to a predetermined rule is associated with a document that belongs to the document set and includes the word string pattern;
A search execution unit that receives a search keyword input and detects from the pattern information stored in the pattern information storage unit a word string pattern including any of the search keyword and a phrase related to the search keyword;
A pattern information output unit for outputting at least a part of the word string pattern detected by the search execution unit and the sentence associated therewith;
An example sentence search device comprising:

The pattern information storage unit stores a plurality of the pattern information corresponding to a plurality of the document sets,
4. The example sentence search device according to claim 3, wherein the search execution unit receives a selection input from the plurality of pattern information, and detects a word string pattern including the keyword from the selected pattern information. .

The pattern information output unit
A pattern output unit for outputting the word string pattern detected by the search execution unit;
A sentence output unit that accepts a selection input from the word string pattern output by the pattern output unit, and outputs the sentence associated with the selected word string pattern;
The example sentence search device according to claim 1, further comprising:

In the pattern information stored in the pattern information storage unit, the word string pattern is further associated with the appearance frequency of the word string pattern in the document set,
The example pattern search apparatus according to claim 5, wherein the pattern output unit further outputs the appearance frequency for each word string pattern detected by the search execution unit.

Receiving search keyword input,
Refer to pattern information that associates a word string pattern that is stored in advance and extracted from a user-specified document set according to a predetermined rule with a sentence that is included in the document belonging to the document set and that includes the word string pattern. Detecting a word string pattern including any of the search keyword and a phrase related to the search keyword;
Outputting the sentence associated with at least some of the detected word string patterns;
A method for searching example sentences, comprising:

A function of storing pattern information in which a word string pattern extracted from a user-specified document set according to a predetermined rule and a sentence included in the document set and including the word string pattern are associated with each other;
A function of accepting a search keyword input, and detecting a word string pattern including any of the search keyword and a phrase related to the search keyword from the pattern information;
A function of outputting at least a part of the detected word string pattern and the sentence associated therewith;
A computer program for causing a computer to realize the above.

A word string pattern extracted from a document set according to a predetermined rule, an appearance frequency of the word string pattern in the document set, and a sentence included in the document belonging to the document set and including the word string pattern are associated with each other A recording medium for recording.