JP6782644B2

JP6782644B2 - Information processing systems, information processing methods, and computer programs

Info

Publication number: JP6782644B2
Application number: JP2017008054A
Authority: JP
Inventors: 羽翔毛
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2017-01-20
Filing date: 2017-01-20
Publication date: 2020-11-11
Anticipated expiration: 2037-01-20
Also published as: CN110235127B; JP2018116591A; CN110235127A; WO2018135023A1

Description

本発明はデータ処理技術に関し、特に情報処理システム、情報処理方法、およびコンピュータプログラムに関する。 The present invention relates to data processing techniques, particularly to information processing systems, information processing methods, and computer programs.

多数の文献の中からキーワードに合致する文献を効率的に検索するために、予め検索母集団の文献に対する形態素解析が実行され、検索用のインデックスデータが作成されることがある。 In order to efficiently search a large number of documents that match the keyword, morphological analysis of the documents of the search population may be executed in advance, and index data for search may be created.

特開２００２−２９７５８９号公報JP-A-2002-297589

形態素解析には単語等が記録された解析用の辞書を予め用意する必要がある。この辞書には単語の漏れが少ないこと、すなわち網羅性が要求される。特に検索対象が特許文献等の技術文書の場合、専門用語が多く記述されることがあり、辞書の網羅性を人手により担保することは容易でなかった。 For morphological analysis, it is necessary to prepare a dictionary for analysis in which words and the like are recorded in advance. This dictionary is required to have few word omissions, that is, completeness. In particular, when the search target is a technical document such as a patent document, many technical terms may be described, and it is not easy to manually guarantee the completeness of the dictionary.

本発明は、上記課題に鑑みてなされたものであり、主な目的は、形態素解析等の自然言語処理のための好適な辞書の実現を支援する技術を提供することにある。 The present invention has been made in view of the above problems, and a main object of the present invention is to provide a technique for supporting the realization of a suitable dictionary for natural language processing such as morphological analysis.

上記課題を解決するために、本発明のある態様の情報処理システムは、コンピュータによる自然言語処理において使用される辞書であって、単体で意味が成立する単語である主体語が格納される辞書を記憶する辞書記憶部と、第１文書の文字列から、少なくとも辞書に格納済の主体語を除外した残りの文字列を句として、複数の句を抽出する句抽出部と、句抽出部により抽出された複数の句のうち所定個数以上の句の先頭部分に同一の文字列が存在する場合、その同一の文字列を主体語として抽出する主体語抽出部と、主体語抽出部により抽出された主体語を辞書へ格納する辞書更新部と、を備える。 In order to solve the above problems, the information processing system of a certain aspect of the present invention is a dictionary used in natural language processing by a computer, and a dictionary in which a subject word, which is a word whose meaning is established by itself, is stored. A dictionary storage unit to store, a phrase extraction unit that extracts a plurality of phrases from the character string of the first document, and the remaining character string excluding at least the main word stored in the dictionary as a phrase, and a phrase extraction unit When the same character string exists at the beginning of a predetermined number or more of the plurality of phrases, the same character string is extracted as the subject word by the subject word extraction unit and the subject word extraction unit. It is equipped with a dictionary update unit that stores the subject word in the dictionary.

本発明の別の態様は、情報処理方法である。この方法は、コンピュータによる自然言語処理において使用される辞書であって、単体で意味が成立する単語である主体語が格納される辞書を記憶する情報処理装置が、第１文書の文字列から、少なくとも辞書に格納済の主体語を除外した残りの文字列を句として、複数の句を抽出する句抽出ステップと、句抽出ステップで抽出された複数の句のうち所定個数以上の句の先頭部分に同一の文字列が存在する場合、その同一の文字列を主体語として抽出する主体語抽出ステップと、主体語抽出ステップで抽出された主体語を辞書へ格納するステップと、を実行する。 Another aspect of the present invention is an information processing method. In this method, an information processing device that stores a dictionary in which a subject word, which is a word whose meaning is established by itself, is stored in a dictionary used in natural language processing by a computer, is used from a character string of the first document. A phrase extraction step that extracts a plurality of phrases using at least the remaining character string excluding the main word stored in the dictionary as a phrase, and the beginning part of a predetermined number or more of the phrases extracted in the phrase extraction step. If the same character string exists in, the subject word extraction step of extracting the same character string as the subject word and the step of storing the subject word extracted in the subject word extraction step in the dictionary are executed.

なお、以上の構成要素の任意の組合せ、本発明の表現を、プログラム、プログラムを格納した記録媒体などの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above components and the conversion of the expression of the present invention between a program, a recording medium in which the program is stored, and the like are also effective as aspects of the present invention.

本発明によれば、形態素解析等の自然言語処理のための好適な辞書の実現を支援することができる。 According to the present invention, it is possible to support the realization of a suitable dictionary for natural language processing such as morphological analysis.

実施例の情報処理システムの構成を示す図である。It is a figure which shows the structure of the information processing system of an Example. 図１の辞書作成装置の機能構成を示すブロック図である。It is a block diagram which shows the functional structure of the dictionary making apparatus of FIG. 句の文字列を模式的に示す図である。It is a figure which shows the character string of a phrase schematically. 図４（ａ）、図４（ｂ）、図４（ｃ）は句の分割の例を示す図である。4 (a), 4 (b), and 4 (c) are diagrams showing an example of phrase division. 句分割部から出力された複数の第２の句の例を示す図である。It is a figure which shows the example of a plurality of 2nd phrases output from a phrase division part. 主体語辞書の例を示す図である。It is a figure which shows the example of the subject word dictionary. 付随語辞書の例を示す図である。It is a figure which shows the example of the attendant word dictionary. 図１の自然言語処理装置の機能構成を示すブロック図である。It is a block diagram which shows the functional structure of the natural language processing apparatus of FIG. 候補エントリの例を示す図である。It is a figure which shows the example of a candidate entry. 単語抽出処理対象の文字列の例を示す図である。It is a figure which shows the example of the character string which is the word extraction processing target. 辞書作成装置の動作を示すフローチャートである。It is a flowchart which shows the operation of a dictionary making apparatus. 自然言語処理装置の動作を示すフローチャートである。It is a flowchart which shows the operation of a natural language processing apparatus.

まず実施例の概要を説明する。実施例の情報処理システムは、様々な言語で記述された特許文献に対する自然言語処理、例えば形態素解析を支援する。コンピュータによる自然言語処理で使用される辞書には、収録する単語に漏れがないこと、すなわち網羅性が要求されるが、特許文献には専門用語が多く記述され、人手により作成された辞書では十分な網羅性を提供することが難しい。そこで実施例では、多数の特許文献に基づいて自然言語処理用の辞書を作成する過程における人の関与をなくし、網羅性の高い辞書を自動的に作成する第１の技術を提案する。さらに実施例では、第１の技術により自動的に作成した辞書を使用した自然言語処理として、特許文献から高精度に単語を抽出する第２の技術も提案する。 First, the outline of the embodiment will be described. The information processing system of the embodiment supports natural language processing, such as morphological analysis, on patent documents written in various languages. Dictionaries used in natural language processing by computers are required to have complete words, that is, completeness, but many technical terms are described in patent documents, and a dictionary created manually is sufficient. It is difficult to provide comprehensiveness. Therefore, in the embodiment, we propose a first technique for automatically creating a highly comprehensive dictionary by eliminating human involvement in the process of creating a dictionary for natural language processing based on a large number of patent documents. Further, in the embodiment, as a natural language processing using a dictionary automatically created by the first technique, a second technique for extracting words from patent documents with high accuracy is also proposed.

図１は、実施例の情報処理システム１０の構成を示す。情報処理システム１０は、特許文献ＤＢ１２、辞書作成装置１４、自然言語処理装置１６を備える。図１の各装置は、ＬＡＮ・ＷＡＮ・インターネットを含む通信網１８を介して接続される。 FIG. 1 shows the configuration of the information processing system 10 of the embodiment. The information processing system 10 includes a patent document DB 12, a dictionary creation device 14, and a natural language processing device 16. Each device of FIG. 1 is connected via a communication network 18 including LAN, WAN, and the Internet.

特許文献ＤＢ１２は、複数の特許文献のデータ、例えば出願公開済の多数の特許文献のデータを保持するデータベースサーバである。辞書作成装置１４は、自然言語処理装置１６による形態素解析処理で使用される辞書（以下「形態素辞書」とも呼ぶ。）を作成する情報処理装置であり、例えばＰＣ・サーバ等であってもよい。 The patent document DB 12 is a database server that holds data of a plurality of patent documents, for example, data of a large number of patent documents whose applications have been published. The dictionary creation device 14 is an information processing device that creates a dictionary (hereinafter, also referred to as a “morpheme dictionary”) used in morphological analysis processing by the natural language processing device 16, and may be, for example, a PC / server.

自然言語処理装置１６は、辞書作成装置１４により作成された形態素辞書を使用して、特許文献ＤＢ１２に保持された特許文献データに対する形態素解析処理を実行する。具体的には、自然言語処理装置１６は、各特許文献データから単語を抽出し、各特許文献から抽出した単語のリストを含む抽出単語データを所定の外部装置（不図示）へ送信する。 The natural language processing device 16 uses the morphological dictionary created by the dictionary creating device 14 to execute morphological analysis processing on the patent document data held in the patent document DB 12. Specifically, the natural language processing device 16 extracts words from each patent document data, and transmits the extracted word data including a list of the extracted words from each patent document to a predetermined external device (not shown).

抽出単語データの送信先となる装置は、各特許文献の抽出単語データに基づいて特許検索用のインデックスデータを作成する装置であってもよい。また、そのインデックスデータを用いて特許検索サービスを不図示のユーザ端末へ提供する装置であってもよい。なお、自然言語処理装置１６は、インデックスデータの作成機能および／または特許検索サービスの提供機能をさらに備えてもよく、例えば特許検索装置として実装されてもよい。 The device to which the extracted word data is transmitted may be a device that creates index data for patent search based on the extracted word data of each patent document. Further, the device may be a device that provides a patent search service to a user terminal (not shown) using the index data. The natural language processing device 16 may further include an index data creation function and / or a patent search service providing function, and may be implemented as, for example, a patent search device.

なお、図１に示す各装置の物理的な個数に制限はない。例えば、実施例の辞書作成装置１４の機能は複数台の装置が分散実行してもよく、同様に辞書作成装置１４の機能も複数台の装置が分散実行してもよい。また、実施例の辞書作成装置１４の機能と自然言語処理装置１６の機能の両方を１台の情報処理装置が一括して提供してもよい。 There is no limit to the physical number of each device shown in FIG. For example, the functions of the dictionary creating device 14 of the embodiment may be distributed and executed by a plurality of devices, and similarly, the functions of the dictionary creating device 14 may be distributed and executed by a plurality of devices. Further, both the function of the dictionary creating device 14 and the function of the natural language processing device 16 of the embodiment may be collectively provided by one information processing device.

図２は、図１の辞書作成装置１４の機能構成を示すブロック図である。辞書作成装置１４は、制御部２０、記憶部２２、通信部２４を備える。制御部２０は、形態素辞書を作成するための各種データ処理を実行する。記憶部２２は、制御部２０により更新または参照されるデータの記憶領域である。通信部２４は、所定の通信プロトコルにしたがって外部装置と通信する。制御部２０は、通信部２４を介して、特許文献ＤＢ１２および自然言語処理装置１６とデータを交換する。 FIG. 2 is a block diagram showing a functional configuration of the dictionary creation device 14 of FIG. The dictionary creation device 14 includes a control unit 20, a storage unit 22, and a communication unit 24. The control unit 20 executes various data processes for creating a morpheme dictionary. The storage unit 22 is a storage area for data updated or referenced by the control unit 20. The communication unit 24 communicates with the external device according to a predetermined communication protocol. The control unit 20 exchanges data with the patent document DB 12 and the natural language processing device 16 via the communication unit 24.

本明細書のブロック図で示す各ブロックは、ハードウェア的には、コンピュータのＣＰＵやメモリをはじめとする素子や機械装置で実現でき、ソフトウェア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組合せによっていろいろなかたちで実現できることは、当業者には理解されるところである。例えば、制御部２０内の各ブロックに対応するモジュールを含むコンピュータプログラムが、所定の記録媒体またはネットワークを介して、辞書作成装置１４のストレージへインストールされてもよい。そして、辞書作成装置１４のＣＰＵが、ストレージからメインメモリへコンピュータプログラムを適宜読み出し、実行することにより、制御部２０内の各ブロックに対応する機能を発揮してもよい。 Each block shown in the block diagram of the present specification can be realized by an element or a mechanical device such as a CPU or a memory of a computer in terms of hardware, and can be realized by a computer program or the like in terms of software. , Draws a functional block realized by their cooperation. Therefore, it is understood by those skilled in the art that these functional blocks can be realized in various forms by combining hardware and software. For example, a computer program including a module corresponding to each block in the control unit 20 may be installed in the storage of the dictionary creation device 14 via a predetermined recording medium or network. Then, the CPU of the dictionary creating device 14 may appropriately read a computer program from the storage to the main memory and execute the computer program to exert a function corresponding to each block in the control unit 20.

記憶部２２は、辞書記憶部２６を含む。辞書記憶部２６は、２種類の形態素辞書として主体語辞書と付随語辞書を記憶する。主体語辞書には、単体で意味が成立する単語（以下「主体語」とも呼ぶ。）が格納される。主体語は、典型的には自立語（名詞、動詞、形容詞等）であり、例えば「食堂」、「特許」、「国家」等の文字列を含む。なお、実施例における主体語は２文字以上に制限される。 The storage unit 22 includes a dictionary storage unit 26. The dictionary storage unit 26 stores a subject word dictionary and an incidental word dictionary as two types of morpheme dictionaries. The subject word dictionary stores words whose meanings are established by themselves (hereinafter, also referred to as "subject words"). The subject word is typically an independent word (noun, verb, adjective, etc.) and includes, for example, a character string such as "dining room", "patent", or "nation". The subject word in the examples is limited to two or more characters.

付随語辞書には、主体語に付随して使用される単語（以下「付随語」とも呼ぶ。）が格納される。付随語は、典型的には付属語（助詞、助動詞）である。例えば付随語は、「を」、「は」、「です」、「開発部」の「部」、「懇談会」の「会」、「貴方の」の「の」を含む。付随語は、主体語と結合して初めて文節が成立する単語であってもよい。また、付随語は、主体語と結合して初めて意味が成立する（言い換えれば意味が確定する）単語であってもよい。 The incidental word dictionary stores words used in association with the subject word (hereinafter, also referred to as "accompanying words"). Accompanying words are typically particles (particles, auxiliary verbs). For example, incidental words include "o", "ha", "desu", "department" of "development department", "kai" of "advisory panel", and "no" of "your". The contingent word may be a word whose phrase is established only when it is combined with the subject word. Further, the incidental word may be a word whose meaning is established (in other words, the meaning is determined) only when it is combined with the subject word.

制御部２０は、句読込部２８、句分割部３０、主体語抽出部３２、辞書更新部３４、付随語抽出部３６、辞書出力部３８を含む。句読込部２８と句分割部３０は、互いに連携して句抽出部として機能し、特許文献に記載された文字列から、少なくとも主体語辞書に格納されている主体語を除外した残りの文字列を句と識別し、複数の句を抽出する。 The control unit 20 includes a phrase reading unit 28, a phrase dividing unit 30, a subject word extraction unit 32, a dictionary update unit 34, an accompanying word extraction unit 36, and a dictionary output unit 38. The phrase reading unit 28 and the phrase dividing unit 30 function as a phrase extraction unit in cooperation with each other, and the remaining character string excluding at least the subject word stored in the subject word dictionary from the character string described in the patent document. Is identified as a phrase and multiple phrases are extracted.

句読込部２８は、特許文献ＤＢ１２に格納された複数の特許文献から、各特許文献に記載された文字列を読み込む。句読込部２８は、文字列中の処理対象位置（初期状態では例えば先頭文字）から、所定の区切り文字（句点、読点、ピリオド、カンマ等）までの文字列を第１の句として抽出する処理と、抽出した句の分だけ処理対象位置を先に進める処理とを繰り返す。 The phrase reading unit 28 reads the character strings described in each patent document from the plurality of patent documents stored in the patent document DB 12. The phrase reading unit 28 extracts a character string from a processing target position in the character string (for example, the first character in the initial state) to a predetermined delimiter (punctuation mark, reading point, period, comma, etc.) as the first phrase. And the process of advancing the processing target position by the amount of the extracted phrase are repeated.

句分割部３０は、主体語辞書と付随語辞書を参照して、句読込部２８により抽出された第１の句から、当該第１の句を分割した複数の第２の句を抽出する。句分割部３０は、少なくとも主体語辞書に格納済の主体語を除外した残りの文字列を第２の句として識別し、複数の第２の句を抽出する。実施例の句分割部３０は、主体語辞書に格納済の主体語を示す句と、その主体語を除外した残りの文字列の両方を第２の句として出力する。 The phrase dividing unit 30 refers to the subject word dictionary and the incidental word dictionary, and extracts a plurality of second phrases obtained by dividing the first phrase from the first phrase extracted by the phrase reading unit 28. The phrase dividing unit 30 identifies at least the remaining character string excluding the subject word stored in the subject word dictionary as the second phrase, and extracts a plurality of second phrases. The phrase division unit 30 of the embodiment outputs both the phrase indicating the subject word stored in the subject word dictionary and the remaining character string excluding the subject word as the second phrase.

句分割部３０の処理を図３を参照しつつ説明する。図３の１つの箱は１文字を示す。Ｘは某かの文字であり、ｂは主体語辞書に格納された主体語の各文字であり、Ｃは付随語辞書に格納された付随語の文字である。（１）句分割部３０は、次の検索位置集合Ｐｓ（初期は先頭位置０）を保持する。Ｐｓは、主体語か否かを判定すべき位置を示す情報の集合である。Ｐｓの要素（すなわち位置を示す情報）は、文書全体の何バイト目かを示す情報でもよく、第１の句の先頭からの距離（文字単位、バイト単位等）を示す情報でもよい。図３ではＰｓ＝｛Ｐ０，Ｐ１｝とする。 The process of the phrase dividing unit 30 will be described with reference to FIG. One box in FIG. 3 represents one character. X is a certain character, b is each character of the subject word stored in the subject word dictionary, and C is a character of the incidental word stored in the incidental word dictionary. (1) The phrase dividing unit 30 holds the next search position set Ps (initially, the head position is 0). Ps is a set of information indicating a position for determining whether or not it is a subject word. The element of Ps (that is, information indicating the position) may be information indicating the number of bytes in the entire document, or information indicating the distance from the beginning of the first phrase (character unit, byte unit, etc.). In FIG. 3, Ps = {P0, P1}.

（２）句分割部３０は、Ｐｓの最小位置（図３ではＰ０）から次の単語を探す。図３では主体語「ｂｂｂｂ」が見つかる。（３）句分割部３０は、Ｐ０から、見つかった主体語までの文字列（図３では「ＸＸ」）を句として出力するとともに、見つかった主体語「ｂｂｂｂ」を句として出力する。図３の例では、「ＸＸ」が、第１の句から主体語を除外した文字列に該当する。 (2) The phrase dividing unit 30 searches for the next word from the minimum position of Ps (P0 in FIG. 3). In FIG. 3, the subject word "bbbb" is found. (3) The phrase dividing unit 30 outputs the character string (“XX” in FIG. 3) from P0 to the found subject word as a phrase, and outputs the found subject word “bbbb” as a phrase. In the example of FIG. 3, "XX" corresponds to the character string excluding the subject word from the first phrase.

上記（２）で１つ以上の主体語が見つかった場合、全ての主体語に対して以下の（４）（５）を順次もしくは並行して実行する。（４）句分割部３０は、１つの主体語（図３では「ｂｂｂｂ」の終了直後の位置（図３の例では文字「Ｃ」の位置であり、すなわちＰ２）をＰｓを入れる。（５）上記（４）における主体語の直後が付随語である場合、その付属語の終了直後の位置（図３の例では文字「Ｘ」）の位置であり、すなわちＰ３）をＰｓに入れる。（６）上記（２）で主体語が見つからなかった場合、そのときの処理位置（例えばＰ０）から文字列の最後までを句として出力する。（７）Ｐｓがからになるまで、上記（２）〜（６）を繰り返す。例えば、Ｐ０の処理が終了すると、ＰＳ＝｛Ｐ１，Ｐ２，Ｐ３｝となり、次の（２）ではＰ１から次の単語を探す。 When one or more subject words are found in the above (2), the following (4) and (5) are executed sequentially or in parallel for all the subject words. (4) The phrase dividing unit 30 inserts Ps into one subject word (in FIG. 3, the position immediately after the end of “bbbb” (in the example of FIG. 3, the position of the character “C”, that is, P2)). ) When the adjunct word immediately after the subject word in (4) above is the position immediately after the end of the adjunct word (the letter "X" in the example of FIG. 3), that is, P3) is put in Ps. (6) When the subject word is not found in (2) above, the phrase is output from the processing position (for example, P0) at that time to the end of the character string. (7) The above steps (2) to (6) are repeated until Ps becomes empty. For example, when the processing of P0 is completed, PS = {P1, P2, P3}, and in the next (2), the next word is searched from P1.

図４（ａ）、図４（ｂ）、図４（ｃ）は句の分割の例を示す。ここでは、主体語辞書に「日本」、「振興」が記録され、付随語辞書に「語」、「会」が記録されているとする。図４（ａ）は、分割前の元データである第１の句を示している。句分割部３０は、「日本」と「振興」は主体語辞書に存在するため主体語として認識し、第２の句として出力する。また、句分割部３０は、「語」の位置から探索し、直後の主体語の開始は「振」であるため、「語」から「振」の前までの文字列、すなわち「語」を第２の句として出力する。「語」は付随語であるため、「語」の位置と「振」の位置の両方から次の主体語を探す。同様に「会」は付随語であるため、「会」の位置と「日」の位置の両方から次の主体語を探す（図４（ｂ））。 4 (a), 4 (b), and 4 (c) show examples of phrase division. Here, it is assumed that "Japan" and "promotion" are recorded in the subject word dictionary, and "word" and "kai" are recorded in the accompanying word dictionary. FIG. 4A shows the first phrase which is the original data before division. The phrase division unit 30 recognizes "Japan" and "promotion" as the subject words because they exist in the subject word dictionary, and outputs them as the second phrase. Further, the phrase dividing unit 30 searches from the position of the "word", and since the start of the subject word immediately after is "shake", the character string from "word" to before "shake", that is, "word" is searched. Output as the second phrase. Since "word" is an incidental word, the next subject word is searched from both the position of "word" and the position of "shake". Similarly, since "kai" is an incidental word, the next subject word is searched for from both the position of "kai" and the position of "day" (Fig. 4 (b)).

「会」から後の文字列には辞書登録済みの主体語がないため、探索位置から文字列の最後までを第２の句として出力する。図４（ｃ）は、句分割部３０から出力された５個の第２の句を示している。なお、同一位置から複数の主体語を検出した場合、句分割部３０は、検出した全単語に対して上記処理を実行する。この場合、一部が重複する第２の句を含む複数の第２の句が抽出されることになる。 Since there is no subject word registered in the dictionary in the character string after "kai", the second phrase is output from the search position to the end of the character string. FIG. 4C shows five second phrases output from the phrase dividing unit 30. When a plurality of subject words are detected from the same position, the phrase dividing unit 30 executes the above processing for all the detected words. In this case, a plurality of second phrases including the second phrase that partially overlaps are extracted.

図２に戻り、主体語抽出部３２は、句分割部３０から出力された複数の第２の句のうち所定個数以上の句の先頭部分に同一の文字列が存在する場合、その同一の文字列を主体語の候補として抽出する。具体的には、主体語抽出部３２は、句分割部３０から出力された複数の第２の句を文字コード順に並べる。主体語抽出部３２は、複数の第２の句に亘って先頭の文字列が共通する場合にその先頭の文字列を主体語の候補として出力する。言い換えれば、複数の第２の句それぞれの先頭から、同一の文字コード値が２文字分以上連続している場合、その同一の文字コード値が示す文字列を主体語の候補とする。 Returning to FIG. 2, when the same character string exists at the beginning of a predetermined number or more of the plurality of second phrases output from the phrase dividing unit 30, the subject word extraction unit 32 has the same character. Extract the column as a candidate for the subject word. Specifically, the subject word extraction unit 32 arranges a plurality of second phrases output from the phrase division unit 30 in the order of the character code. The subject word extraction unit 32 outputs the first character string as a candidate for the subject word when the first character string is common over a plurality of second phrases. In other words, when the same character code value is continuous for two or more characters from the beginning of each of the plurality of second phrases, the character string indicated by the same character code value is used as a candidate for the subject word.

図５は、句分割部３０から出力された複数の第２の句の例を示す。図５の例では、「日本」「日本大学」「日本料理」のいずれも、予め定められた閾値（実施例では５個）以上の句の先頭部分に繰り返し出現している。したがって、主体語抽出部３２は、「日本」「日本大学」「日本料理」を主体語の候補として抽出する。上記の閾値は、システム開発者の知見や、情報処理システム１０を用いた実験等に基づいて適切な値に設定されればよい。例えば、主体語の網羅性を強く求める場合、閾値を小さくしてもよく、網羅性とともにある程度の正確性を求める場合、閾値を大きくしてもよい。 FIG. 5 shows an example of a plurality of second phrases output from the phrase dividing unit 30. In the example of FIG. 5, all of "Japan", "Nihon University", and "Japanese cuisine" repeatedly appear at the beginning of phrases exceeding a predetermined threshold value (five in the example). Therefore, the subject word extraction unit 32 extracts "Japan", "Nihon University", and "Japanese cuisine" as candidates for the subject word. The above threshold value may be set to an appropriate value based on the knowledge of the system developer, an experiment using the information processing system 10, and the like. For example, when the completeness of the subject word is strongly required, the threshold value may be reduced, and when the completeness and a certain degree of accuracy are required, the threshold value may be increased.

なお、句分割部３０または主体語抽出部３２は、複数の第２の句を示す情報（ここでは「抽出句リスト」と呼ぶ。）を、記憶部２２に永続的に記憶させてもよい。抽出句リストには、複数の辞書作成タイミングおよび複数の特許文献に亘って抽出された第２の句が蓄積されてもよい。主体語抽出部３２は、句分割部３０から出力された第２の句を抽出句リストへ追加した上で、その抽出句リストに記録された所定個数以上の第２の句の先頭部分に同一の文字列が存在すれば、その同一の文字列を主体語の候補として抽出してもよい。この態様によると、複数の辞書作成タイミングおよび複数の特許文献に亘って抽出された第２の句に基づいて、一層精度よく主体語を抽出できる。 The phrase division unit 30 or the subject word extraction unit 32 may permanently store information indicating a plurality of second phrases (here, referred to as an “extracted phrase list”) in the storage unit 22. In the extracted phrase list, a second phrase extracted over a plurality of dictionary creation timings and a plurality of patent documents may be accumulated. The subject word extraction unit 32 adds the second phrase output from the phrase division unit 30 to the extraction phrase list, and is the same as the first portion of the predetermined number or more of the second phrases recorded in the extraction phrase list. If the character string of is present, the same character string may be extracted as a candidate for the subject word. According to this aspect, the subject word can be extracted more accurately based on a plurality of dictionary creation timings and a second phrase extracted over a plurality of patent documents.

また、句分割部３０は、第１の句から、主体語辞書に格納済の主体語を第２の句として抽出することを抑制してもよい。また、主体語抽出部３２は、句分割部３０から出力された複数の第２の句のうち、主体語辞書に格納済の主体語を示す第２の句を、主体語抽出処理の対象から除外してもよい。 Further, the phrase dividing unit 30 may suppress the extraction of the subject word stored in the subject word dictionary as the second phrase from the first phrase. Further, the subject word extraction unit 32 extracts the second phrase indicating the subject word stored in the subject word dictionary from the target of the subject word extraction process among the plurality of second phrases output from the phrase division unit 30. It may be excluded.

図２に戻り、辞書更新部３４は、主体語抽出部３２により抽出された主体語を辞書記憶部２６の主体語辞書へ格納する。辞書更新部３４は、主体語抽出部３２により抽出された１つの主体語の候補における先頭を含む一部が、他の主体語として成立することがなければ、この候補を主体語と確定して主体語辞書へ登録する。辞書更新部３４は、第１の文字列と、その第１の文字列を含む第２の文字列の両方が主体語として抽出された場合、第１の文字列を主体語辞書へ格納する一方、第２の文字列を主体語辞書へ格納することを抑制する。言い換えれば、辞書更新部３４は、主体語抽出部３２により抽出された主体語の候補うち、複数の主体語の組み合わせにより構成された候補（以下「複合語」とも呼ぶ。）については主体語辞書へ格納することを抑制する。既述したように検索の網羅性を向上させるため、実施例の主体語辞書には相対的に短い主体語が優先して格納される。 Returning to FIG. 2, the dictionary update unit 34 stores the subject word extracted by the subject word extraction unit 32 in the subject word dictionary of the dictionary storage unit 26. The dictionary update unit 34 determines this candidate as the subject word unless a part including the beginning of one subject word candidate extracted by the subject word extraction unit 32 is established as another subject word. Register in the subject language dictionary. When both the first character string and the second character string including the first character string are extracted as the subject word, the dictionary update unit 34 stores the first character string in the subject word dictionary. , Suppresses storing the second character string in the subject language dictionary. In other words, the dictionary update unit 34 uses the subject word dictionary for candidates composed of a combination of a plurality of subject words (hereinafter, also referred to as "compound words") among the candidates for the subject words extracted by the subject word extraction unit 32. Suppress storing in. As described above, in order to improve the comprehensiveness of the search, relatively short subject words are preferentially stored in the subject word dictionary of the embodiment.

例えば、「日本」と「日本料理」の両方が主体語の候補として抽出された場合、辞書更新部３４は、「日本」を主体語辞書へ格納する一方、「日本料理」を主体語辞書へ格納せず廃棄し、言い換えれば無視する。なお、「料理」は別途主体語辞書へ登録されることが期待できる。また、主体語辞書に「日本料理」が格納済で、新たに「日本」が主体語の候補として抽出された場合、辞書更新部３４は、「日本」を主体語辞書へ格納するとともに、「日本料理」を主体語辞書から削除してもよい。この場合、後述の出現回数については「日本料理」の出現回数を「日本」の出現回数として引き継いでもよい。 For example, when both "Japan" and "Japanese food" are extracted as candidates for the subject word, the dictionary update unit 34 stores "Japan" in the subject language dictionary while storing "Japanese food" in the subject language dictionary. Discard it without storing it, in other words, ignore it. It can be expected that "cooking" will be registered separately in the subject language dictionary. Further, when "Japanese food" is already stored in the subject word dictionary and "Japan" is newly extracted as a candidate for the subject word, the dictionary update unit 34 stores "Japan" in the subject word dictionary and " "Japanese food" may be deleted from the subject dictionary. In this case, regarding the number of appearances described later, the number of appearances of "Japanese cuisine" may be taken over as the number of appearances of "Japan".

図６は主体語辞書の例を示す。実施例の主体語辞書では、複数の主体語と、各主体語が複数の文書に亘って出現した回数とが対応付けて記録される。辞書更新部３４は、主体語の候補のうち主体語辞書に未格納の候補（すなわち新たな主体語の候補）を主体語辞書へ新たに格納し、その出現回数に１を設定する。また、辞書更新部３４は、主体語の候補のうち主体語辞書に格納済の候補（すなわち過去に検出済の主体語）については主体語辞書へ重複して格納することを抑制しつつ、当該主体語の出現回数をインクリメントする。変形例として、出現回数そのものを記録することに代えて、出現回数に基づく出現頻度の指標値を記録してもよい。 FIG. 6 shows an example of a subject word dictionary. In the subject word dictionary of the embodiment, a plurality of subject words and the number of times each subject word appears in a plurality of documents are recorded in association with each other. The dictionary update unit 34 newly stores candidates that are not stored in the subject word dictionary (that is, candidates for a new subject word) among the candidates for the subject word in the subject word dictionary, and sets 1 for the number of occurrences thereof. Further, the dictionary update unit 34 suppresses the duplication of storing the candidates stored in the subject word dictionary (that is, the subject words detected in the past) among the candidates of the subject word in the subject word dictionary. Increment the number of occurrences of the subject word. As a modification, instead of recording the number of appearances itself, an index value of the frequency of appearance based on the number of appearances may be recorded.

図２に戻り、付随語抽出部３６は、複数の特許文献から読み込まれた文字列から、主体語辞書に格納された主体語を除外した残りの文字列のうち、文字数が所定の閾値以下の文字列を付随語の候補として抽出する。言い換えれば、付随語抽出部３６は、複数の特許文献から読み込まれた文字列において、２つの主体語に挟まれた文字列のうち、長さが所定の閾値以下の文字列を付随語の候補として抽出する。また付随語抽出部３６は、文字列における最後の主体語以降の部分文字列の長さが所定の閾値以下であれば、その部分文字列を付随語の候補として抽出する。実施例における上記閾値は「２」である。特許文献から読み込まれた文字列は、句読込部２８により読み込まれた第１の句であってもよい。 Returning to FIG. 2, the accompanying word extraction unit 36 has the number of characters of the remaining character strings excluding the subject words stored in the subject word dictionary from the character strings read from a plurality of patent documents, which are equal to or less than a predetermined threshold value. Extract the character string as a candidate for an accompanying word. In other words, the incidental word extraction unit 36 selects a character string whose length is equal to or less than a predetermined threshold among the character strings sandwiched between the two subject words in the character strings read from a plurality of patent documents as candidates for the incidental word. Extract as. If the length of the sub-character string after the last subject word in the character string is equal to or less than a predetermined threshold value, the contingent word extraction unit 36 extracts the sub-character string as a candidate for the contingent word. The threshold value in the examples is "2". The character string read from the patent document may be the first phrase read by the phrase reading unit 28.

付随語抽出部３６は、抽出した付随語の候補のうち、候補としての抽出回数が所定順位以上の文字列を付随語と判定する。具体的には、付随語抽出部３６は、複数の付随語の候補と、各候補の抽出回数とを対応付けたテーブルを保持し、付随語の候補を抽出する都度、その候補に対応付けられた抽出回数をインクリメントする。抽出回数は、複数の文書に亘る付随語候補の出現回数または出現頻度を示す値と言える。付随語抽出部３６は、抽出回数が上位５％の候補を付随語と判定する。なお、付随語を判定するタイミングは、１つまたは複数の特許文献に対する主体語抽出処理または主体語辞書更新処理の完了時点でもよい。 The incidental word extraction unit 36 determines that, among the extracted candidates for the incidental word, a character string whose number of extractions as a candidate is equal to or higher than a predetermined order is regarded as an auxiliary word. Specifically, the contingent word extraction unit 36 holds a table in which a plurality of contingent word candidates are associated with the number of times each candidate is extracted, and is associated with the candidate each time the contingent word candidate is extracted. Increment the number of extractions. The number of extractions can be said to be a value indicating the number of occurrences or the frequency of occurrence of the accompanying word candidate over a plurality of documents. The incidental word extraction unit 36 determines that the candidate with the highest number of extractions is the incidental word. The timing of determining the incidental word may be the time when the subject word extraction process or the subject word dictionary update process for one or more patent documents is completed.

このように、付随語の長さの閾値（２文字等）と付随語判定閾値（上位５％等）を設けることにより、付随語抽出の網羅性を高めることができる。これらの閾値は、システム開発者の知見や、情報処理システム１０を用いた実験等に基づいて適切な値に設定されればよい。例えば、付随語の網羅性を強く求める場合、付随語判定閾値を大きく（例えば上位１０％）してもよく、網羅性とともにある程度の正確性を求める場合、付随語判定閾値を小さく（例えば上位３％）してもよい。 In this way, by providing the threshold value for the length of the incidental word (two characters, etc.) and the threshold value for determining the incidental word (upper 5%, etc.), the comprehensiveness of the incidental word extraction can be enhanced. These threshold values may be set to appropriate values based on the knowledge of the system developer, experiments using the information processing system 10, and the like. For example, when the completeness of an adjunct word is strongly requested, the adjoint word judgment threshold value may be increased (for example, the top 10%), and when a certain degree of accuracy is required along with the completeness, the adjunct word judgment threshold value is small (for example, the top 3). %) May be used.

辞書更新部３４は、付随語抽出部３６により付随語と判定された候補文字列を付随語辞書へ格納する。図７は付随語辞書の例を示す。実施例の付随語辞書では、複数の付随語と、各付随語が複数の文書に亘って出現した回数とが対応付けて記録される。辞書更新部３４は、付随語の候補のうち付随語辞書に未格納の候補（すなわち新たな付随語の候補）を付随語辞書へ新たに格納し、その出現回数に１を設定する。その一方、辞書更新部３４は、付随語の候補のうち付随語辞書に格納済の候補（すなわち過去に検出済の付随語）については付随語辞書へ重複して格納することを抑制しつつ、当該付随語の出現回数をインクリメントする。 The dictionary update unit 34 stores the candidate character string determined to be an incident word by the incident word extraction unit 36 in the incident word dictionary. FIG. 7 shows an example of an accompanying word dictionary. In the incidental word dictionary of the embodiment, a plurality of incidental words and the number of times each incidental word appears in a plurality of documents are recorded in association with each other. The dictionary update unit 34 newly stores the candidates of the incidental words that are not stored in the incidental word dictionary (that is, the candidates of the new incidental words) in the incidental word dictionary, and sets the number of occurrences to 1. On the other hand, the dictionary update unit 34 suppresses duplication of storage of candidates stored in the incident word dictionary (that is, incident words detected in the past) among the candidates for the incident word in the incident word dictionary. The number of occurrences of the accompanying word is incremented.

辞書出力部３８は、辞書更新部３４により主体語辞書と付随語辞書の少なくとも一方が更新された場合に、少なくとも更新された辞書データを所定の外部装置へ出力する。実施例では、更新された最新の辞書データを自然言語処理装置１６へ送信するが、変形例として、他の種類のサーバやストレージへ最新の主体語辞書と付随語辞書を出力してもよい。 The dictionary output unit 38 outputs at least the updated dictionary data to a predetermined external device when at least one of the subject word dictionary and the incidental word dictionary is updated by the dictionary update unit 34. In the embodiment, the latest updated dictionary data is transmitted to the natural language processing device 16, but as a modification, the latest subject word dictionary and incidental word dictionary may be output to another type of server or storage.

図８は、図１の自然言語処理装置１６の機能構成を示すブロック図である。自然言語処理装置１６は、辞書作成装置１４の制御部２０、記憶部２２、通信部２４に対応する制御部４０、記憶部４２、通信部４４を備える。制御部４０は、通信部４４を介して、特許文献ＤＢ１２および辞書作成装置１４とデータを交換する。 FIG. 8 is a block diagram showing a functional configuration of the natural language processing device 16 of FIG. The natural language processing device 16 includes a control unit 20, a storage unit 22, a control unit 40 corresponding to the communication unit 24, a storage unit 42, and a communication unit 44 of the dictionary creation device 14. The control unit 40 exchanges data with the patent document DB 12 and the dictionary creation device 14 via the communication unit 44.

記憶部４２は、辞書記憶部４６と解析結果記憶部４８を含む。辞書記憶部４６は、辞書作成装置１４の辞書記憶部２６に対応し、辞書作成装置１４により作成された主体語辞書および付随語辞書を保持する。解析結果記憶部４８は、制御部４０による自然言語処理の結果を示すデータを保持する。具体的には、複数の特許文献のそれぞれから抽出された単語のリストを含む特許文献ごとの抽出単語データを保持する。 The storage unit 42 includes a dictionary storage unit 46 and an analysis result storage unit 48. The dictionary storage unit 46 corresponds to the dictionary storage unit 26 of the dictionary creation device 14, and holds the main word dictionary and the incidental word dictionary created by the dictionary creation device 14. The analysis result storage unit 48 holds data indicating the result of natural language processing by the control unit 40. Specifically, it holds extracted word data for each patent document, including a list of words extracted from each of a plurality of patent documents.

制御部４０は、文書読込部５０、スコア設定部５２、単語抽出部５４、解析結果提供部５６を含む。文書読込部５０は、自然言語処理の対象となる文書（実施例では特許文献ＤＢ１２に格納された特許文献）から、当該文書に記載された文字列を読み込む。例えば、句点・ピリオド等をデリミタとして文単位で文字列を読み込んでもよく、句読込部２８と同様に句単位で文字列を読み込んでもよい。 The control unit 40 includes a document reading unit 50, a score setting unit 52, a word extraction unit 54, and an analysis result providing unit 56. The document reading unit 50 reads the character string described in the document from the document (patent document stored in the patent document DB 12 in the embodiment) to be processed in natural language. For example, a character string may be read in sentence units using a punctuation mark, a period, or the like as a delimiter, or a character string may be read in phrase units as in the phrase reading unit 28.

スコア設定部５２は、文書読込部５０により読み込まれた文字列中の各文字から始まる主体語を辞書記憶部４６の主体語辞書を参照して特定する。スコア設定部５２は、主体語辞書、付随語辞書、および予め定められたスコア付与基準にしたがって、各文字から始まる各主体語に対して、各主体語の属性に応じたスコアを付与する。実施例のスコア付与基準は、人間から見て自然な単語の抽出であることを目的としており、以下に示す３つの基準の組み合わせである。変形例として、いずれか１つの基準または２つの基準を用いてもよい。 The score setting unit 52 identifies the subject word starting from each character in the character string read by the document reading unit 50 with reference to the subject word dictionary of the dictionary storage unit 46. The score setting unit 52 assigns a score according to the attribute of each subject word to each subject word starting from each character according to the subject word dictionary, the accompanying word dictionary, and a predetermined score assigning standard. The scoring criteria of the examples are aimed at extracting words that are natural to humans, and are a combination of the following three criteria. As a modification, any one or two criteria may be used.

（基準１）文字列中の各文字から始まる主体語のうち、内包する付随語の個数が相対的に少ない主体語のスコアを、内包する付随語の個数が相対的に多い主体語のスコアより大きくする。内包する付随語の個数が少ない主体語ほど、スコアを大きくするよう調整してもよい。多くの付随語を内包する主体語は、本来複合語である可能性が高く、単語として抽出することが好ましくないと考えられるからである。 (Criteria 1) Of the subject words starting with each character in the character string, the score of the subject word with a relatively small number of concomitant words is higher than the score of the subject word with a relatively large number of concomitant words. Enlarge. The score may be adjusted to increase as the number of accompanying words included is smaller. This is because the subject word containing many incidental words is likely to be a compound word by nature, and it is considered that it is not preferable to extract it as a word.

（基準２）文字列中の各文字から始まる主体語のうち、文字数が相対的に多い主体語のスコアを、文字数が相対的に少ない主体語のスコアより大きくする。文字数が多い主体語ほど大きなスコアになるよう調整してもよい。文字数が多い主体語は、文書内で特徴的な単語と考えられ、抽出対象として適切と考えられるからである。 (Criteria 2) Among the subject words starting with each character in the character string, the score of the subject word having a relatively large number of characters is made higher than the score of the subject word having a relatively small number of characters. The subject word with a large number of characters may be adjusted so that the score becomes larger. This is because a subject word having a large number of characters is considered to be a characteristic word in a document and is considered to be an appropriate extraction target.

（基準３）文字列中の各文字から始まる主体語のうち、主体語辞書で対応付けられた出現回数が相対的に多い主体語のスコアを、出現回数が相対的に少ない主体語のスコアより大きくする。出現回数が多い主体語ほど大きなスコアになるよう調整してもよい。出現回数が多い主体語は、多くの文献で使用されたものであり、抽出対象として適切と考えられるからである。 (Criteria 3) Of the subject words starting with each character in the character string, the score of the subject word associated with the subject word dictionary with a relatively large number of occurrences is higher than the score of the subject word with a relatively small number of occurrences. Enlarge. The score may be adjusted so that the subject word that appears more frequently has a higher score. This is because the subject words that appear frequently are those used in many documents and are considered to be appropriate as extraction targets.

実施例のスコア設定部５２は、文書読込部５０により読み込まれた文字列中の文字ごとに１つの候補エントリを作成する。図９は候補エントリの例を示す。同図の位置「ｅ０」は、読み込まれた文字列中の処理対象位置（言い換えれば現在処理位置）を示す。同図の「ｅ０＋ｎ」（ｎは正の整数）は、処理対象位置からの距離（例えば文字数）を示している。スコア設定部５２は、各文字から始まる主体語（ここでは「単語候補」と呼ぶ。）を特定し、各単語候補の文字数を候補エントリへ設定する。またスコア設定部５２は、単語候補ごとに算出したスコアを候補エントリへ設定する。 The score setting unit 52 of the embodiment creates one candidate entry for each character in the character string read by the document reading unit 50. FIG. 9 shows an example of candidate entries. The position "e0" in the figure indicates a processing target position (in other words, a current processing position) in the read character string. “E0 + n” (n is a positive integer) in the figure indicates the distance (for example, the number of characters) from the processing target position. The score setting unit 52 identifies a subject word starting with each character (referred to as “word candidate” here), and sets the number of characters of each word candidate in the candidate entry. Further, the score setting unit 52 sets the score calculated for each word candidate in the candidate entry.

スコアの算出方法を説明する。スコア設定部５２は、上記の基準１にしたがって、各単語候補の中に含まれる付随語の個数が多いほど単語候補のスコア要素（Ｐ）を小さくする。さらにスコア設定部５２は、単語候補の１文字目が付随語に該当する場合、スコア要素（Ｐ）をさらに小さくする。スコア要素（Ｐ）の算出式を以下に示す。

The method of calculating the score will be described. The score setting unit 52 reduces the score element (P) of the word candidate as the number of accompanying words included in each word candidate increases according to the above criterion 1. Further, the score setting unit 52 further reduces the score element (P) when the first character of the word candidate corresponds to the accompanying word. The formula for calculating the score element (P) is shown below.

式１の「Ｗ」は単語候補の文字列である。式１および式２の「Ｌ」は全付随語の集合である。式１の「Ｃ」は、文字ａ、ｂの付随語としての出現回数を付随語辞書から取得する関数である。したがって、「Ｃ（ａ）」、「Ｃ（ｂ）」はそれぞれ、文字ａ、ｂの付随語としての出現回数である。式２の「Ｐ０」は、単語候補の１文字目（Ｗ０）が付随語（Ｌ）である場合の特別ペナルティ値であり、０＜Ｐ０＜１の範囲で設定される。式１のβは定数である。式１内の分数は、単語候補内の付随語ごとに評価され、内包する付随語の個数が多いほどＰは小さくなり、１文字目が付随語であればＰはさらに小さくなる。また、単語候補内の付随語に対応付けられた出現回数（Ｃ（ａ））が大きいほど、言い換えれば、使用頻度が高い付随語が内包されているほどＰは小さくなる。スコア要素（Ｐ）は、付随語を内包したことに基づくペナルティを反映した値になる。 “W” in Equation 1 is a character string of word candidates. The "L" in Equations 1 and 2 is a set of all concomitant words. “C” in Equation 1 is a function for acquiring the number of occurrences of the characters a and b as an incidental word from the incidental word dictionary. Therefore, "C (a)" and "C (b)" are the number of occurrences of the letters a and b as incidental words, respectively. “P0” in the equation 2 is a special penalty value when the first character (W0) of the word candidate is the accompanying word (L), and is set in the range of 0 <P0 <1. Β in Equation 1 is a constant. The fraction in Equation 1 is evaluated for each incidental word in the word candidate, and P becomes smaller as the number of incidental words included increases, and P becomes smaller if the first character is an incidental word. Further, the larger the number of occurrences (C (a)) associated with the incidental word in the word candidate, in other words, the smaller the P becomes as the frequently used incidental word is included. The score element (P) is a value that reflects a penalty based on the inclusion of ancillary words.

次に、スコア設定部５２は、式１で求めた単語候補のＰ（Ｗ）、単語候補の文字数（ｒ）、単語候補の出現回数（ｔ）に基づいて単語候補のスコア（Ｑ）を決定する。スコア（Ｑ）の算出式を以下に示す。

Next, the score setting unit 52 determines the score (Q) of the word candidate based on the word candidate P (W) obtained in Equation 1, the number of characters of the word candidate (r), and the number of appearances of the word candidate (t). To do. The formula for calculating the score (Q) is shown below.

式３の関数ｆは、３つのパラメータ、すなわち単語候補のＰ（Ｗ）、単語候補の文字数（ｒ）、単語候補の出現回数（ｔ）と、単語候補のスコア（Ｑ）とが正相関するように単語候補のスコア（Ｑ）を決定する関数である。スコア設定部５２は、式３にしたがって、単語候補のＰ（Ｗ）が大きいほど、かつ、単語候補の文字数（ｒ）が多いほど、かつ、単語候補の出現回数（ｔ）が多いほど、単語候補のスコア（Ｑ）を大きく算出する。例えば、式３の関数ｆは、単語候補のＰ（Ｗ）、単語候補の文字数（ｒ）、単語候補の出現回数（ｔ）を乗じる関数であってもよい。すなわち「Ｑ＝Ｐ（Ｗ）×ｒ×ｔ」であってもよい。 In the function f of Equation 3, the three parameters, that is, the word candidate P (W), the number of word candidate characters (r), the number of appearances of the word candidate (t), and the score of the word candidate (Q) are positively correlated. It is a function that determines the score (Q) of a word candidate. According to Equation 3, the score setting unit 52 determines that the larger the P (W) of the word candidate, the larger the number of characters (r) of the word candidate, and the larger the number of appearances (t) of the word candidate, the more the word. The candidate score (Q) is calculated to be large. For example, the function f in Equation 3 may be a function that multiplies the word candidate P (W), the number of character candidates (r), and the number of appearances of the word candidate (t). That is, it may be "Q = P (W) × r × t".

単語抽出部５４は、スコア設定部５２により設定された各単語候補のスコアを参照して、文字列中の処理対象位置から始まる第１の単語候補に付与されたスコアと、第１の単語候補の中の２文字目以降から始まる第２の単語候補に付与されたスコアとの大小関係に応じて、第１の単語候補を単語として抽出する。実施例の単語抽出部５４は、大小関係の比較対象として、第１の単語候補の直後の文字から始まる第３の単語候補に付与されたスコアも使用する。 The word extraction unit 54 refers to the score of each word candidate set by the score setting unit 52, and refers to the score given to the first word candidate starting from the processing target position in the character string and the first word candidate. The first word candidate is extracted as a word according to the magnitude relationship with the score given to the second word candidate starting from the second character in the following. The word extraction unit 54 of the embodiment also uses the score given to the third word candidate starting from the character immediately after the first word candidate as a comparison target of the magnitude relationship.

具体的には、単語抽出部５４は、文字列中の処理対象位置から始まる単語候補（ここでは「現在候補」と呼ぶ。）に付与されたスコアを「現在候補スコア」として識別する。また、単語抽出部５４は、文字列中において現在候補の直後の文字から始まる単語候補（ここでは「次回候補」と呼ぶ。）に付与されたスコアを「次回候補スコア」として識別する。さらに、現在候補内の２文字目以降から始まる単語候補（ここでは「中間候補」と呼ぶ。）に付与されたスコアを「中間候補スコア」（ν）として識別する。 Specifically, the word extraction unit 54 identifies the score given to the word candidate (referred to as “current candidate” here) starting from the processing target position in the character string as the “current candidate score”. Further, the word extraction unit 54 identifies the score given to the word candidate (referred to as "next candidate" here) starting from the character immediately after the current candidate in the character string as the "next candidate score". Further, the score given to the word candidate (referred to as "intermediate candidate" here) starting from the second character or later in the current candidate is identified as the "intermediate candidate score" (ν).

現在候補スコアと次回候補スコアは、スコア設定部５２により算出されているため、単語抽出部５４は、中間候補スコアνを新たに算出する。中間候補スコアνの算出式を以下に示す。

Since the current candidate score and the next candidate score are calculated by the score setting unit 52, the word extraction unit 54 newly calculates the intermediate candidate score ν. The formula for calculating the intermediate candidate score ν is shown below.

式４のｅ０は処理対象位置の候補エントリ（すなわち現在候補）であり、｜ｅ０｜は現在候補の文字数である。Ｔは現在候補以降の全候補エントリである。Ｓは、（処理対象位置＋１）から、（処理対象位置＋｜ｅ０｜−１）までを単語の開始位置（ｉ）として、Ｔを複数の単語に分解し、各単語のスコアを合算する。式４の関数Ｓは、処理対象位置（すなわちｅ０）の単語を抽出しない場合の中間候補スコアを提供する関数と言える。 E0 in the equation 4 is a candidate entry (that is, a current candidate) of the processing target position, and | e0 | is the number of characters of the current candidate. T is all candidate entries after the current candidate. S decomposes T into a plurality of words from (processing target position +1) to (processing target position + | e0 | -1) as the word start position (i), and adds up the scores of each word. The function S in Equation 4 can be said to be a function that provides an intermediate candidate score when the word at the processing target position (that is, e0) is not extracted.

単語抽出部５４は、現在候補スコアが０の場合、処理対象位置の文字が主体語を構成しないと認識し、処理対象位置の１文字を捨てて、処理対象位置を１つ進める。なお、現在候補スコアが０の場合、処理対象位置の文字を単語（例えば付随語）と認識して、処理対象位置の１文字を単語として抽出してもよい。 When the current candidate score is 0, the word extraction unit 54 recognizes that the characters at the processing target position do not constitute the subject word, discards one character at the processing target position, and advances the processing target position by one. When the current candidate score is 0, the character at the processing target position may be recognized as a word (for example, an incidental word), and one character at the processing target position may be extracted as a word.

単語抽出部５４は、現在候補スコアが０より大きい場合、式４にしたがってνを算出する。ν≦現在候補スコアまたはν≦次回候補スコアが成立する場合、単語抽出部５４は、処理対象位置から始まる単語（すなわち現在候補）を抽出し、現在候補の文字数分、処理対象位置を進める。一方、ν＞現在候補スコアかつν＞次回候補スコアが成立する場合、単語抽出部５４は、処理対象位置の１文字を捨てて、処理対象位置を１つ進める。現在候補の途中の文字から始まる単語を抽出した方がスコアが大きいからである。 If the current candidate score is greater than 0, the word extraction unit 54 calculates ν according to Equation 4. When ν ≦ current candidate score or ν ≦ next candidate score is established, the word extraction unit 54 extracts a word starting from the processing target position (that is, the current candidate) and advances the processing target position by the number of characters of the current candidate. On the other hand, when ν> the current candidate score and ν> the next candidate score are satisfied, the word extraction unit 54 discards one character at the processing target position and advances the processing target position by one. This is because the score is higher when the word starting from the character in the middle of the candidate is extracted.

図１０は、単語抽出処理対象の文字列の例を示す。同図は、図９に対応し、処理対象位置をｅ０で示している。単語抽出部５４は、図９のｅ０の候補エントリのスコア「１５」を現在候補スコアとして識別し、図９のｅ０＋５（ｅ０の単語候補の文字数が５であるため）の候補エントリのスコア「１１」を次回候補スコアとして識別する。そして、図１０では、図９のｅ０＋１の候補エントリのスコア「８」と、図９のｅ０＋３の候補エントリのスコア「１０」との合算値「１８」を中間候補スコア（ν）として識別する。図１０の例では、ν＞現在候補スコアかつν＞次回候補スコアが成立するため、単語抽出部５４は、処理対象位置の文字「新」を捨てて、処理対象位置を１つ進める。 FIG. 10 shows an example of a character string to be processed for word extraction. The figure corresponds to FIG. 9, and the processing target position is indicated by e0. The word extraction unit 54 identifies the score “15” of the candidate entry of e0 in FIG. 9 as the current candidate score, and the score “11” of the candidate entry of e0 + 5 (because the number of characters of the word candidate of e0 is 5) in FIG. Is identified as the next candidate score. Then, in FIG. 10, the total value “18” of the score “8” of the candidate entry of e0 + 1 in FIG. 9 and the score “10” of the candidate entry of e0 + 3 in FIG. 9 is identified as the intermediate candidate score (ν). In the example of FIG. 10, since ν> the current candidate score and ν> the next candidate score are established, the word extraction unit 54 discards the character “new” at the processing target position and advances the processing target position by one.

実際には、単語抽出部５４は、現在候補の３文字目から始まる中間候補と、その中間候補の直後から始まる単語のスコアの合算値、および、現在候補の４文字目から始まる中間候補と、その中間候補の直後から始まる単語のスコアの合算値も算出する。そして式４で示すように、複数種類の合算値の中で最大の値をνの値として採用する。 Actually, the word extraction unit 54 includes the intermediate candidate starting from the third character of the current candidate, the total value of the scores of the words starting immediately after the intermediate candidate, and the intermediate candidate starting from the fourth character of the current candidate. The total value of the scores of the words starting immediately after the intermediate candidate is also calculated. Then, as shown in Equation 4, the maximum value among the plurality of types of total values is adopted as the value of ν.

単語抽出部５４は、或る特許文献から読み込まれた文字列から抽出した複数の単語をその特許文献に対応付けて、解析結果記憶部４８の抽出単語データの中へ記録する。解析結果提供部５６は、解析結果記憶部４８に記憶された抽出単語データであり、すなわち、各特許文献の文字列を単語に分解したデータを、定期的または要求に応じて、予め定められた外部装置（例えば特許文献の検索装置）へ送信する。 The word extraction unit 54 associates a plurality of words extracted from a character string read from a certain patent document with the patent document, and records them in the extracted word data of the analysis result storage unit 48. The analysis result providing unit 56 is the extracted word data stored in the analysis result storage unit 48, that is, the data obtained by decomposing the character string of each patent document into words is predetermined periodically or as required. It is transmitted to an external device (for example, a patent document search device).

以上の構成による情報処理システム１０の動作を説明する。
図１１は、辞書作成装置１４の動作を示すフローチャートである。辞書作成装置１４は、ユーザからの指示を受け付けた場合、および／または、予め定められたタイミングで定期的に、図１１に示す辞書作成処理を実行する。辞書作成処理の開始時点では、辞書記憶部２６の主体語辞書と付随語辞書はいずれも空である。すなわち、辞書作成処理の初期状態において、主体語と付随語を予め定めた辞書は不要である。 The operation of the information processing system 10 with the above configuration will be described.
FIG. 11 is a flowchart showing the operation of the dictionary creation device 14. The dictionary creation device 14 executes the dictionary creation process shown in FIG. 11 when it receives an instruction from the user and / or periodically at a predetermined timing. At the start of the dictionary creation process, both the main word dictionary and the auxiliary word dictionary of the dictionary storage unit 26 are empty. That is, in the initial state of the dictionary creation process, a dictionary in which the subject word and the incidental word are predetermined is unnecessary.

辞書作成装置１４の句読込部２８は、特許文献ＤＢ１２から複数の特許文献を取得し、それぞれの特許文献から第１の句を読み込む（Ｓ１０）。句分割部３０は、第１の句が示す文字列を、辞書記憶部２６の主体語辞書に格納済の主体語を示す文字列と、その主体語を除外した文字列とに分割することにより複数の第２の句を抽出する（Ｓ１２）。なお、主体語辞書が空の場合等、第１の句において主体語が未検出であれば、句分割部３０は、第１の句をそのまま第２の句として出力する。 The phrase reading unit 28 of the dictionary creating device 14 acquires a plurality of patent documents from the patent document DB 12 and reads the first phrase from each patent document (S10). The phrase dividing unit 30 divides the character string indicated by the first phrase into a character string indicating the subject word stored in the subject word dictionary of the dictionary storage unit 26 and a character string excluding the subject word. A plurality of second phrases are extracted (S12). If the subject word is not detected in the first phrase, such as when the subject word dictionary is empty, the phrase division unit 30 outputs the first phrase as it is as the second phrase.

所定個数以上の第２の句において先頭から始まる文字列（すなわち第２の句の少なくとも一部の文字列）が共通する場合（Ｓ１４のＹ）、主体語抽出部３２は、その先頭文字列を主体語の候補として抽出する（Ｓ１６）。辞書更新部３４は、主体語抽出部３２により抽出された主体語の候補のうち、複合語を除外した候補であり、かつ、主体語辞書に未格納の新たな主体語を示す候補を主体語辞書へ格納する（Ｓ１８）。その際に辞書更新部３４は、各主体語に対応付けられた出現回数も更新する。所定個数以上の第２の句において先頭から始まる共通文字列が存在しない場合（Ｓ１４のＮ）、Ｓ１６とＳ１８の処理をスキップする。 When the character strings starting from the beginning (that is, at least a part of the character strings of the second phrase) are common in the predetermined number or more of the second phrases (Y in S14), the subject word extraction unit 32 uses the first character strings. It is extracted as a candidate for the subject word (S16). The dictionary update unit 34 is a candidate that excludes compound words from the candidates for the subject word extracted by the subject word extraction unit 32, and is a candidate that indicates a new subject word that is not stored in the subject word dictionary. Store in the dictionary (S18). At that time, the dictionary update unit 34 also updates the number of occurrences associated with each subject word. If there is no common character string starting from the beginning in the second phrase of a predetermined number or more (N in S14), the processes of S16 and S18 are skipped.

付随語抽出部３６は、主体語辞書を参照して、第１の句において２つの主体語間に２文字以下の文字列（すなわち主体語を構成しない文字列）が存在する場合（Ｓ２０のＹ）、その文字列を付随語の候補として抽出し、各候補の抽出回数を加算する（Ｓ２２）。付随語抽出部３６は、複数の付随語の候補のうち抽出回数が相対的に上位の候補を付随語と判定する。辞書更新部３４は、付随語抽出部３６により付随語と判定された候補文字列を付随語辞書へ格納する（Ｓ２４）。第１の句において２つの主体語間に２文字以下の文字列が存在しなければ（Ｓ２０のＮ）、Ｓ２２とＳ２４の処理をスキップする。 The incidental word extraction unit 36 refers to the subject word dictionary, and when there is a character string of two or less characters (that is, a character string that does not constitute the subject word) between the two subject words in the first phrase (Y in S20). ), The character string is extracted as a candidate for an accompanying word, and the number of extractions of each candidate is added (S22). The contingent word extraction unit 36 determines that the candidate having a relatively high number of extractions among the candidates for the plurality of contingent words is regarded as the contingent word. The dictionary update unit 34 stores the candidate character string determined to be an incident word by the incident word extraction unit 36 in the incident word dictionary (S24). If there is no character string of 2 characters or less between the two subject words in the first phrase (N in S20), the processing of S22 and S24 is skipped.

文字列読み込みが未完了の特許文献が残存すれば（Ｓ２６のＮ）、Ｓ１０に戻り、全ての特許文献からの文字列読み込みが完了すれば（Ｓ２６のＹ）、辞書出力部３８は、辞書記憶部２６に記憶された主体語辞書と付随語辞書を自然言語処理装置１６へ送信する（Ｓ２８）。なお、付随語抽出処理は、複数の特許文献それぞれの主体語抽出処理が終了する都度実行してもよく、全ての特許文献の主体語抽出処理が終了した後に実行してもよい。また、辞書出力部３８は、主体語辞書と付随語辞書を所定のサーバへアップロードしてもよく、自然言語処理装置１６は、主体語辞書と付随語辞書をそのサーバからダウンロードしてもよい。 If the patent document in which the character string reading is not completed remains (N in S26), the process returns to S10, and if the character string reading from all the patent documents is completed (Y in S26), the dictionary output unit 38 stores the dictionary. The subject word dictionary and the accompanying word dictionary stored in the unit 26 are transmitted to the natural language processing device 16 (S28). The accompanying word extraction process may be executed each time the subject word extraction process of each of the plurality of patent documents is completed, or may be executed after the subject word extraction process of all the patent documents is completed. Further, the dictionary output unit 38 may upload the subject word dictionary and the incidental word dictionary to a predetermined server, and the natural language processing device 16 may download the subject word dictionary and the incidental word dictionary from the server.

図１２は、自然言語処理装置１６の動作を示すフローチャートである。自然言語処理装置１６は、ユーザからの指示を受け付けた場合、および／または、予め定められたタイミングで定期的に、図１２に示す自然言語処理を実行する。実施例での自然言語処理は、特許文献に記載された文字列から単語（特に主体語）を抽出する処理である。 FIG. 12 is a flowchart showing the operation of the natural language processing device 16. The natural language processing device 16 executes the natural language processing shown in FIG. 12 when it receives an instruction from the user and / or periodically at a predetermined timing. The natural language processing in the embodiment is a process of extracting a word (particularly the subject word) from the character string described in the patent document.

不図示の外部装置（ここでは特許検索装置とする。）における検索インデックスデータの更新タイミングに達したことを検出すると（Ｓ３０のＹ）、自然言語処理装置１６の文書読込部５０は、特許文献ＤＢ１２から複数の特許文献を取得し、それぞれの特許文献の文字列を読み込む（Ｓ３２）。スコア設定部５２は、読み込まれた文字列中の各文字から始まる主体語を特定する。スコア設定部５２は、所定のスコア付与基準にしたがって、各文字から始まる主体語それぞれのスコアを導出し、複数の文字に対応する複数の候補エントリを作成する（Ｓ３４）。 When it is detected that the update timing of the search index data in the external device (referred to as the patent search device here) (not shown) has been reached (Y in S30), the document reading unit 50 of the natural language processing device 16 causes the patent document DB 12 A plurality of patent documents are obtained from the above, and the character strings of the respective patent documents are read (S32). The score setting unit 52 identifies the subject word starting from each character in the read character string. The score setting unit 52 derives the score of each subject word starting from each character according to a predetermined score giving standard, and creates a plurality of candidate entries corresponding to the plurality of characters (S34).

単語抽出部５４は、各候補エントリのスコアを大小比較し、その結果に基づいて特許文献の文字列から単語を抽出して、解析結果記憶部４８の抽出単語データを更新する（Ｓ３６）。単語抽出処理が未完了の特許文献が残存すれば（Ｓ３８のＮ）、Ｓ３２に戻る。全ての特許文献からの単語抽出処理が終了すれば（Ｓ３８のＹ）、解析結果提供部５６は、解析結果記憶部４８の抽出単語データを特許検索装置へ送信する（Ｓ４０）。特許検索装置は、例えば、自然言語処理装置１６から提供された抽出単語データを使用して、特許文献検索用のインデックスデータを更新する。特許検索装置における検索インデックスデータの更新タイミングでなければ（Ｓ３０のＮ）、以降の処理をスキップして、本図のフローを終了する。 The word extraction unit 54 compares the scores of each candidate entry by magnitude, extracts words from the character string of the patent document based on the result, and updates the extracted word data of the analysis result storage unit 48 (S36). If the patent document in which the word extraction process is not completed remains (N in S38), the process returns to S32. When the word extraction process from all the patent documents is completed (Y in S38), the analysis result providing unit 56 transmits the extracted word data of the analysis result storage unit 48 to the patent search device (S40). The patent search device updates the index data for patent document search by using, for example, the extracted word data provided by the natural language processing device 16. If it is not the update timing of the search index data in the patent search device (N in S30), the subsequent processing is skipped and the flow of this figure ends.

実施例の情報処理システム１０によると、主体語および付随語を予め定めたデータがなくても、主体語および付随語を自動的に抽出し、主体語辞書および付随語辞書を自動作成することができる。また、複数の句に亘り共通する先頭文字列を主体語として自動抽出するとともに、主体語間の短い文字列を付随語として自動抽出することにより、網羅性の高い主体語辞書および付随語辞書を作成できる。例えば、誤った綴りの文字列は通常辞書から除外されるが、実施例の情報処理システム１０では、実際に文献に記載されていれば主体語または付随語として抽出でき、網羅性の高い形態素辞書を提供できる。また、新たな文献が追加された場合（例えば新たに特許公開公報が発行された場合）に、迅速かつ効率的に主体語辞書および付随語辞書を更新することができる。 According to the information processing system 10 of the embodiment, it is possible to automatically extract the subject word and the incidental word and automatically create the subject word dictionary and the incidental word dictionary even if there is no predetermined data for the subject word and the incidental word. it can. In addition, by automatically extracting the first character string that is common to multiple phrases as the subject word and automatically extracting the short character string between the subject words as the incidental word, a highly comprehensive subject word dictionary and incidental word dictionary can be obtained. Can be created. For example, an erroneously spelled character string is usually excluded from the dictionary, but in the information processing system 10 of the embodiment, it can be extracted as a subject word or an incidental word if it is actually described in a document, and is a highly comprehensive morpheme dictionary. Can be provided. In addition, when a new document is added (for example, when a new patent publication is issued), the subject word dictionary and the incidental word dictionary can be updated quickly and efficiently.

また、実施例の情報処理システム１０によると、文献の文字列から単語を抽出する処理において、文字列の各文字から始まる各主体語にスコアを付与する。そして、処理対象位置から始まる第１の主体語のスコアと、第１の主体語の中の２文字目以降から始まる第２の主体語のスコアとの大小関係に応じて、第１の主体語を抽出するか否かを決定する。これにより、適切な位置からの単語の切り出しを実現しやすくなる。さらに実施例では、第１の主体語の直後の文字から始まる第３の主体語のスコアも比較対象に含めることで、より一層適切な位置で単語を抽出することができる。 Further, according to the information processing system 10 of the embodiment, in the process of extracting a word from a character string of a document, a score is given to each subject word starting from each character of the character string. Then, according to the magnitude relationship between the score of the first subject word starting from the processing target position and the score of the second subject word starting from the second character onward in the first subject word, the first subject word Decide whether or not to extract. This makes it easier to cut out words from appropriate positions. Further, in the embodiment, the word can be extracted at a more appropriate position by including the score of the third subject word starting from the character immediately after the first subject word in the comparison target.

以上、本発明を実施例をもとに説明した。この実施例は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described above based on examples. This embodiment is an example, and it is understood by those skilled in the art that various modifications are possible for each of these components and combinations of each processing process, and that such modifications are also within the scope of the present invention. ..

第１変形例を説明する。第１変形例では、複数の主体語の組み合わせにより構成された複合語を辞書へ登録することを許可する。具体的には、辞書作成装置１４の辞書更新部３４は、第１の文字列（例えば「日本」）と、その第１の文字列を含む第２の文字列（例えば「日本料理」）が主体語の候補として抽出された場合、第１の文字列と第２の文字列の両方を主体語辞書へ格納してもよい。この結果、主体語辞書には複合語も収録される。 The first modification will be described. In the first modification, it is permitted to register a compound word composed of a combination of a plurality of subject words in a dictionary. Specifically, the dictionary update unit 34 of the dictionary creation device 14 has a first character string (for example, "Japan") and a second character string including the first character string (for example, "Japanese cuisine"). When extracted as a candidate for the subject word, both the first character string and the second character string may be stored in the subject word dictionary. As a result, compound words are also recorded in the subject language dictionary.

自然言語処理装置１６が処理対象とする文字列では、「日本」と「日本料理」のように、特定の処理対象位置から始まる主体語（すなわち現在候補ｅ０）が複数存在しうる。自然言語処理装置１６のスコア設定部５２は、文字列中の各文字から始まる１つ以上の主体語（例えば「日本」、「日本料理」）のそれぞれに対してスコアを算出する。 In the character string to be processed by the natural language processing device 16, there may be a plurality of subject words (that is, current candidate e0) starting from a specific processing target position, such as "Japan" and "Japanese food". The score setting unit 52 of the natural language processing device 16 calculates a score for each of one or more subject words (for example, "Japan" and "Japanese food") starting from each character in the character string.

自然言語処理装置１６の単語抽出部５４が用いる中間候補スコアνの算出式を以下に示す。

The formula for calculating the intermediate candidate score ν used by the word extraction unit 54 of the natural language processing device 16 is shown below.

式５のＤ０は、Ｔの先頭に存在する１つ以上の現在候補ｅ０の集合である。単語抽出部５４は、１つ以上の現在候補のそれぞれについて実施例の式４と同じ計算を実行して関数Ｓの最大値を求め、さらに１つ以上の現在候補に跨っての関数Ｓの最大値を中間候補スコアνとする。例えば、２つの現在候補（現在候補１、現在候補２）があり、２つの次回候補（次回候補１、次回候補２）があるとする。単語抽出部５４は、ν＞現在候補１スコア、ν＞次回候補１スコア、ν＞現在候補２スコア、ν＞次回候補２スコアが成立する場合、処理対象位置の１文字を捨てて、処理対象位置を１つ進めてもよい。また、上記の条件が成立せず、現在候補１スコア≧現在候補２スコアであれば、現在候補１を単語として抽出し、現在候補１スコア＜現在候補２スコアであれば、現在候補２を単語として抽出してもよい。 D0 in Equation 5 is a set of one or more current candidates e0 existing at the beginning of T. The word extraction unit 54 executes the same calculation as in Equation 4 of the embodiment for each of the one or more current candidates to obtain the maximum value of the function S, and further, the maximum value of the function S across one or more current candidates. Let the value be the intermediate candidate score ν. For example, suppose that there are two current candidates (current candidate 1, current candidate 2) and two next candidates (next candidate 1, next candidate 2). When ν> current candidate 1 score, ν> next candidate 1 score, ν> current candidate 2 score, ν> next candidate 2 score are established, the word extraction unit 54 discards one character at the processing target position and processes the target. You may advance the position by one. If the above conditions are not satisfied and the current candidate 1 score ≥ the current candidate 2 score, the current candidate 1 is extracted as a word, and if the current candidate 1 score <current candidate 2 score, the current candidate 2 is a word. It may be extracted as.

第１変形例によると、複合語としての単語の抽出に対応することができる。なお、第１変形例においても、複合語でない主体語が主体語辞書に登録されることは実施例と同様であり、辞書の網羅性は低下しない。 According to the first modification, it is possible to correspond to the extraction of a word as a compound word. In the first modification, the subject word that is not a compound word is registered in the subject word dictionary as in the embodiment, and the completeness of the dictionary is not deteriorated.

第２変形例を説明する。辞書作成装置１４の句読込部２８は、特許文献から読み込んだ文字列の文字コード値が該当する文字コード区分（例えば、ラテン文字、ＣＪＫ文字、アラビア文字、ロシア文字等）を対象言語として識別してもよい。例えば、特許文献の文字コードがＵＮＩＣＯＤＥの場合、句読込部２８は、ＵＮＩＣＯＤＥのコード表を参照して、特許文献から読み込んだ文字列の文字コード区分を識別してもよい。第２変形例の一態様として、句読込部２８は、対象言語の連続文字列を句として読み込む一方、対象言語に該当しない文字（文字コード値）を無視してもよく、空白等の意味を持たない文字に置き換えて読み込んでもよい。 A second modification will be described. The phrase reading unit 28 of the dictionary creation device 14 identifies the character code classification (for example, Latin characters, CJK characters, Arabic characters, Russian characters, etc.) corresponding to the character code value of the character string read from the patent document as the target language. You may. For example, when the character code of the patent document is Unicode, the punctuation reading unit 28 may identify the character code classification of the character string read from the patent document by referring to the Unicode code table. As one aspect of the second modification, the phrase reading unit 28 reads a continuous character string of the target language as a phrase, while ignoring characters (character code values) that do not correspond to the target language, meanings such as blanks. You may replace it with a character that you do not have and read it.

第２変形例の別の態様として、辞書作成装置１４の句読込部２８は、複数の文字コード区分を許容してもよい。例えば、句読込部２８は、複数の文字コード区分に亘る文字列を第１の句として読み込んでもよい。この場合、句読込部２８は、特許文献の文字列の途中で文字コード区分が第１の区分から第２の区分に変わる場合に、第１の区分に属する文字列と第２の区分に属する文字列を異なる第１の句として抽出してもよい。または、句分割部３０は、句読込部２８により抽出された第１の句を既知の主体語に基づいて分割しつつ、さらに、第１の句の途中で文字コード区分が第１の区分から第２の区分に変わる場合に、第１の区分に属する文字列と第２の区分に属する文字列を異なる第２の句として抽出してもよい。 As another aspect of the second modification, the phrase reading unit 28 of the dictionary creation device 14 may allow a plurality of character code divisions. For example, the phrase reading unit 28 may read a character string spanning a plurality of character code divisions as the first phrase. In this case, the phrase reading unit 28 belongs to the character string belonging to the first division and the second division when the character code division changes from the first division to the second division in the middle of the character string of the patent document. The character string may be extracted as a different first phrase. Alternatively, the phrase dividing unit 30 divides the first phrase extracted by the phrase reading unit 28 based on the known subject word, and further, the character code division is changed from the first division in the middle of the first phrase. When changing to the second division, the character string belonging to the first division and the character string belonging to the second division may be extracted as different second phrases.

第３変形例を説明する。上記実施例では言及していないが、特許文献への検索履歴を辞書へフィードバックしてもよい。具体的には、辞書作成装置１４の主体語辞書では、各主体語に、検索に使用された頻度を示す指標値（以下「検索頻度値」と呼ぶ。）が対応付けて記録される。辞書作成装置１４は、特許検索に使用されたキーワードまたはキーフレーズを示すデータ（以下「検索履歴」と呼ぶ。）を特許検索装置（不図示）から定期的に取得して、記憶部２２の検索履歴記憶部へ格納する検索履歴取得部をさらに備える。辞書作成装置１４の辞書更新部３４は、検索履歴記憶部に記憶された検索履歴と、主体語辞書とを参照して検索履歴に含まれる主体語を特定し、検索履歴における各主体語の検索回数を特定する。辞書更新部３４は、各主体語の検索回数に基づいて各主体語の検索頻度値（検索回数そのものでもよい）を導出し、主体語辞書へ格納する。 A third modification will be described. Although not mentioned in the above embodiment, the search history of the patent document may be fed back to the dictionary. Specifically, in the subject word dictionary of the dictionary creating device 14, an index value (hereinafter, referred to as “search frequency value”) indicating the frequency used for the search is recorded in association with each subject word. The dictionary creation device 14 periodically acquires data indicating a keyword or key phrase used in the patent search (hereinafter referred to as “search history”) from the patent search device (not shown), and searches the storage unit 22. It also has a search history acquisition unit that stores it in the history storage unit. The dictionary update unit 34 of the dictionary creation device 14 refers to the search history stored in the search history storage unit and the subject word dictionary to identify the subject words included in the search history, and searches for each subject word in the search history. Identify the number of times. The dictionary update unit 34 derives the search frequency value of each subject word (the number of searches itself may be used) based on the number of searches of each subject word, and stores it in the subject word dictionary.

主体語辞書には、特許文献ＤＢ１２の特許文献における各主体語の出現回数とともに検索頻度値が格納されもよく、出現回数に代えて検索頻度値が格納されてもよい。自然言語処理装置１６のスコア設定部５２は、読み込まれた文字列内の各文字から始まる単語のスコアを決定する際に、実施例の出現回数に代えて検索頻度値を使用してもよく、実施例の出現回数とともに検索頻度値を使用してもよい。例えば、スコア設定部５２は、検索頻度値が相対的に大きい単語のスコアが、検索頻度値が相対的に小さい単語のスコアより大きくなるように調整してもよい。 In the subject word dictionary, the search frequency value may be stored together with the number of occurrences of each subject word in the patent document of the patent document DB 12, and the search frequency value may be stored instead of the number of occurrences. The score setting unit 52 of the natural language processing device 16 may use the search frequency value instead of the number of occurrences of the examples when determining the score of the word starting from each character in the read character string. The search frequency value may be used with the number of occurrences of the examples. For example, the score setting unit 52 may adjust the score of a word having a relatively high search frequency value to be higher than the score of a word having a relatively low search frequency value.

第４変形例を説明する。上記実施例では言及していないが、特許文献の単語化における各単語の使用頻度を辞書へフィードバックしてもよい。具体的には、辞書作成装置１４は、特許文献の単語化における各単語の使用回数または使用頻度を示す単語化履歴を自然言語処理装置１６から取得して、記憶部２２の単語化履歴記憶部へ格納する単語化履歴取得部をさらに備える。辞書作成装置１４の辞書更新部３４は、単語化履歴記憶部に記憶された単語化履歴を参照して、単語化履歴における各主体語の使用回数または使用頻度を特定する。辞書更新部３４は、各主体語の使用回数または使用頻度を主体語辞書へ格納する。自然言語処理装置１６のスコア設定部５２は、読み込まれた文字列内の各文字から始まる単語のスコアを決定する際に、実施例に記載の辞書作成時の出現回数に代えて、単語化における使用回数または使用頻度を使用してもよい。例えば、スコア設定部５２は、使用頻度値が相対的に高い単語のスコアが、使用頻度が相対的に低い単語のスコアより大きくなるように調整してもよい。 A fourth modification will be described. Although not mentioned in the above embodiment, the frequency of use of each word in the wording of the patent document may be fed back to the dictionary. Specifically, the dictionary creating device 14 acquires a wording history indicating the number of times or frequency of use of each word in the wording of the patent document from the natural language processing device 16, and the wording history storage unit of the storage unit 22. It also has a wording history acquisition unit to be stored in. The dictionary update unit 34 of the dictionary creation device 14 refers to the wording history stored in the wording history storage unit to specify the number of times or frequency of use of each subject word in the wording history. The dictionary update unit 34 stores the number of times or frequency of use of each subject word in the subject word dictionary. When determining the score of a word starting from each character in the read character string, the score setting unit 52 of the natural language processing device 16 replaces the number of occurrences at the time of creating the dictionary described in the embodiment with the word conversion. The number of uses or frequency of use may be used. For example, the score setting unit 52 may adjust the score of a word having a relatively high frequency of use to be higher than the score of a word having a relatively low frequency of use.

第５変形例を説明する。上記実施例では、辞書作成の初期状態において、主体語辞書と付随語辞書が空であるとし、すなわち、主体語と付随語が予め登録されていない構成とした。変形例として、辞書作成の初期状態において、辞書作成装置１４の辞書記憶部２６に既存辞書（例えば予め人手で作成された主体語辞書および付随語辞書）を格納してもよい。人手で作成された辞書は、収録単語の正確性・精度は高いが網羅性が低い。辞書作成装置１４の辞書更新部３４は、主体語抽出部３２により自動抽出された主体語を既存の主体語辞書へ追加し、付随語抽出部３６により自動抽出された付随語を既存の付随語辞書へ追加する。これにより、正確性と網羅性の両方に優れる辞書を作成でき、自然言語処理装置１６における自然言語処理の正確性と網羅性を一層向上させることができる。 A fifth modification will be described. In the above embodiment, it is assumed that the subject word dictionary and the incidental word dictionary are empty in the initial state of dictionary creation, that is, the subject word and the incidental word are not registered in advance. As a modification, an existing dictionary (for example, a subject word dictionary and an incidental word dictionary manually created in advance) may be stored in the dictionary storage unit 26 of the dictionary creation device 14 in the initial state of dictionary creation. A manually created dictionary has high accuracy and accuracy of recorded words, but low comprehensiveness. The dictionary update unit 34 of the dictionary creation device 14 adds the subject word automatically extracted by the subject word extraction unit 32 to the existing subject word dictionary, and adds the incidental word automatically extracted by the incidental word extraction unit 36 to the existing incidental word. Add to the dictionary. As a result, a dictionary excellent in both accuracy and completeness can be created, and the accuracy and completeness of natural language processing in the natural language processing apparatus 16 can be further improved.

第６変形例を説明する。上記実施例では形態素解析を支援する技術を提案したが、ここでは他の利用例を挙げる。実施例に記載の技術は（１）タグクラウドに適用できる。例えば、多数の文献に基づいて辞書作成装置１４が作成した辞書であり、単語（主体語および／または付随語）と出現回数を対応付けた辞書を参照して、複数の単語を、文献での出現回数に応じた態様（形状、模様、色彩等）で表現した画像を生成することができる。 A sixth modification will be described. In the above embodiment, a technique for supporting morphological analysis was proposed, but here, other usage examples are given. The techniques described in the examples can be applied to (1) tag cloud. For example, it is a dictionary created by the dictionary creation device 14 based on a large number of documents, and a plurality of words are referred to in the document by referring to a dictionary in which words (subject words and / or incidental words) are associated with the number of occurrences. It is possible to generate an image expressed in a mode (shape, pattern, color, etc.) according to the number of appearances.

また、実施例に記載の技術は、（２）単語の変動に関するレポート作成に適用できる。例えば、特定分野の多数の文献に基づいて辞書作成装置１４が作成した辞書であり、単語（主体語および／または付随語）と出現回数を対応付けた辞書を参照して、一定期間において、単語の出現回数の変動等をグラフ化することにより、上記特定分野のトレンドや変化を示す情報を生成できる。また、実施例に記載の技術は、（３）各国語辞書の作成に適用できる。例えば、辞書作成装置１４は、各国語で書かれた文献に基づいて、各国語の主体語辞書および付随語辞書を自動生成することができる。 In addition, the techniques described in the examples can be applied to (2) creating a report on word variation. For example, a dictionary created by the dictionary creation device 14 based on a large number of documents in a specific field, with reference to a dictionary in which a word (subject word and / or ancillary word) is associated with the number of occurrences, and a word in a certain period of time. By graphing the fluctuation of the number of appearances of the above, it is possible to generate information indicating the trend or change in the specific field. In addition, the technique described in the examples can be applied to (3) creation of a Japanese dictionary. For example, the dictionary creation device 14 can automatically generate a subject language dictionary and an incidental word dictionary of each national language based on a document written in each national language.

第７変形例を説明する。上記実施例では、特許文献に基づいて主体語辞書および付随語辞書を作成したが、様々な種類、様々な分野の文献、文書データに基づいて主体語辞書および付随語辞書を作成可能であることはもちろんである。なお、自然言語処理装置１６が、特定分野の文献に対する自然言語処理を実行する場合、辞書作成装置１４は、その特定分野の文献から主体語および付随語を抽出することにより、その特定分野の文献に対する自然言語処理に好適な主体語辞書および付随語辞書を作成してもよい。 A seventh modification will be described. In the above embodiment, the subject word dictionary and the incidental word dictionary are created based on the patent documents, but the subject word dictionary and the incidental word dictionary can be created based on the documents and document data of various types and various fields. Of course. When the natural language processing device 16 executes natural language processing on a document in a specific field, the dictionary creating device 14 extracts a subject word and an incidental word from the document in the specific field, thereby performing the document in the specific field. You may create a subject dictionary and an adjunct dictionary suitable for natural language processing.

上述した実施の形態および変形例の任意の組み合わせもまた本発明の実施の形態として有用である。組み合わせによって生じる新たな実施の形態は、組み合わされる実施の形態および変形例それぞれの効果をあわせもつ。また、請求項に記載の各構成要件が果たすべき機能は、実施の形態および変形例において示された各構成要素の単体もしくはそれらの連携によって実現されることも当業者には理解されるところである。 Any combination of the embodiments and modifications described above is also useful as an embodiment of the present invention. The new embodiments resulting from the combination have the effects of the combined embodiments and variants. It is also understood by those skilled in the art that the functions to be fulfilled by each of the constituent elements described in the claims are realized by a single component or a combination thereof shown in the embodiments and modifications. ..

１０情報処理システム、１４辞書作成装置、１６自然言語処理装置、２６辞書記憶部、３２主体語抽出部、３４辞書更新部、３６付随語抽出部、４６辞書記憶部、５２スコア設定部、５４単語抽出部。 10 Information processing system, 14 Dictionary creation device, 16 Natural language processing device, 26 Dictionary storage unit, 32 Subject word extraction unit, 34 Dictionary update unit, 36 Accompanying word extraction unit, 46 Dictionary storage unit, 52 Score setting unit, 54 words Extraction unit.

Claims

A dictionary storage unit that stores a dictionary that stores the main word, which is a dictionary used in natural language processing by a computer and whose meaning is established by itself.
A phrase extraction unit that extracts a plurality of phrases from the character string of the first document, using at least the remaining character string excluding the main word stored in the dictionary as a phrase.
When the same character string exists at the beginning of a predetermined number or more of the plurality of phrases extracted by the phrase extraction unit, the subject word extraction unit that extracts the same character string as the subject word and
A dictionary update unit that stores the subject words extracted by the subject word extraction unit in the dictionary, and
An information processing system characterized by being equipped with.

When both the first character string and the second character string including the first character string are extracted as the subject words, the dictionary update unit stores the first character string in the dictionary. The information processing system according to claim 1, wherein the storage of the second character string in the dictionary is suppressed.

With an additional word extraction section,
The dictionary storage unit stores a subject word dictionary, which is a dictionary in which the subject word is stored, and an incidental word dictionary, which is a dictionary in which incidental words, which are words used in association with the subject word, are stored.
The incidental word extraction unit extracts, as an incidental word, a character string having a predetermined number of characters or less from the remaining character strings excluding the subject word stored in the subject word dictionary from the character string of the first document.
The information processing system according to claim 1 or 2, wherein the dictionary update unit stores the incidental words extracted by the incidental word extraction unit in the dictionary.

The incidental word extraction unit extracts a character string having a length of a predetermined value or less from the remaining character strings as a candidate for an incidental word, and determines that a character string whose number of extractions as a candidate is equal to or higher than a predetermined order is an accessory word. The information processing system according to claim 3, wherein the information processing system is characterized by the above.

It also has a score setting section and a word extraction section.
The dictionary storage unit stores a subject word dictionary, which is a dictionary in which the subject word is stored, and an incidental word dictionary, which is a dictionary in which incidental words, which are words used in association with the subject word, are stored.
The score setting unit identifies a subject word starting with each character in the character string read from the second document to be processed in natural language by referring to the subject word dictionary, and among the subject words starting with each character. , A subject word with a relatively small number of concomitant words is given a larger score than a subject word with a relatively large number of concomitant words.
The word extraction unit starts from the score given to the first subject word starting from the processing target position in the character string read from the second document, and the second and subsequent characters in the first subject word. The information processing system according to claim 1, wherein the first subject word is extracted as a word according to a magnitude relationship with a score given to the second subject word.

It also has a score setting section and a word extraction section.
The score setting unit identifies the subject word starting with each character in the character string read from the second document to be processed in natural language by referring to the dictionary, and the number of characters among the subject words starting with each character. Gives a higher score to a subject word with a relatively large number of characters than a subject word with a relatively small number of characters.
The word extraction unit starts from the score given to the first subject word starting from the processing target position in the character string read from the second document, and the second and subsequent characters in the first subject word. The information processing system according to claim 1, wherein the first subject word is extracted as a word according to a magnitude relationship with a score given to the second subject word.

It also has a score setting section and a word extraction section.
The dictionary update unit records the number of occurrences of each subject word in the first document in the dictionary.
The score setting unit identifies a subject word starting with each character in a character string read from a second document to be processed in natural language by referring to the dictionary, and among the subject words starting with each character, the above-mentioned A subject word with a relatively large number of occurrences is given a higher score than the subject word with a relatively small number of appearances.
The word extraction unit starts from the score given to the first subject word starting from the processing target position in the character string read from the second document, and the second and subsequent characters in the first subject word. The information processing system according to claim 1, wherein the first subject word is extracted as a word according to a magnitude relationship with a score given to the second subject word.

The word extraction unit starts from the score given to the first subject word starting from the processing target position in the character string read from the second document and the second and subsequent characters in the first subject word. The first subject word is a word according to the magnitude relationship between the score given to the second subject word and the score given to the third subject word starting from the character immediately after the first subject word. The information processing system according to any one of claims 5 to 7, wherein the information processing system is extracted as.

An information processing device that stores a dictionary that stores a subject word, which is a dictionary used in natural language processing by a computer and whose meaning is established by itself.
A phrase extraction step of extracting a plurality of phrases from the character string of the first document, using at least the remaining character string excluding the subject word stored in the dictionary as a phrase.
When the same character string exists at the beginning of a predetermined number or more of the plurality of phrases extracted in the phrase extraction step, the subject word extraction step for extracting the same character string as the subject word and
A step of storing the subject word extracted in the subject word extraction step in the dictionary, and
An information processing method characterized by executing.

An information processing device that stores a dictionary that stores a subject word that is a word whose meaning is established by itself, which is a dictionary used in natural language processing by a computer.
A phrase extraction function that extracts a plurality of phrases from the character string of the first document by using at least the remaining character string excluding the main word stored in the dictionary as a phrase.
When the same character string exists at the beginning of a predetermined number or more of the plurality of phrases extracted by the phrase extraction function, the subject word extraction function for extracting the same character string as the subject word and
A function to store the subject words extracted by the subject word extraction function in the dictionary, and
A computer program to realize.