JP5239161B2

JP5239161B2 - Language analysis system, language analysis method, and computer program

Info

Publication number: JP5239161B2
Application number: JP2007000070A
Authority: JP
Inventors: 康秀三浦; 博増市; 大悟杉原; 智子大熊
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2007-01-04
Filing date: 2007-01-04
Publication date: 2013-07-17
Anticipated expiration: 2027-01-04
Also published as: JP2008165675A

Description

本発明は、言語解析システム、および言語解析方法、並びにコンピュータ・プログラムに関する。さらに詳細には、テキスト解析による用語の抽出を実行する言語解析システム、および言語解析方法、並びにコンピュータ・プログラムに関する。 The present invention relates to a language analysis system, a language analysis method, and a computer program. More specifically, the present invention relates to a language analysis system, a language analysis method, and a computer program that execute term extraction by text analysis.

例えばデータベース検索などにおいて適用する検索キーや、用語辞書の索引としてのインデックスの設定など、データ処理において適用する用語を自然言語の文書から抽出する処理は、様々なデータ処理分野において必要となる技術である。 For example, a process for extracting a term applied in data processing from a natural language document such as a search key applied in a database search or an index as an index of a term dictionary is a technique required in various data processing fields. is there.

様々なテキストデータの集合はコーパスと呼ばれる。コーパスに含まれる文書からの用語抽出、すなわち意味のある言語単位としての用語を抽出する研究は従来から行われている。例えば、［車が道路を走る］といったありふれた文書であれば、一般的な形態素解析システムを適用することで、［車］、［道路］、［走る］といった形態素を抽出することが可能である。形態素解析システムは、予め定めた形態素解析用の辞書を適用して、辞書登録語に基づいて意味的最小単位である形態素（ｍｏｒｐｈｅｍｅ）に分節して品詞の認定処理を行なうシステムとして知られている。 A collection of various text data is called a corpus. Research on extracting terms from documents included in a corpus, that is, extracting terms as meaningful language units, has been performed. For example, if it is a common document such as [car runs on road], it is possible to extract morphemes such as [car], [road], and [run] by applying a general morphological analysis system. . The morpheme analysis system is known as a system that applies a predetermined dictionary for morpheme analysis and performs a part-of-speech recognition process by segmenting into morphemes that are semantic minimum units based on dictionary registered words. .

しかしながら、医療分野のように専門性の高い分野の専門用語を適切な形態素に区切ることは難しい。すなわち、医療分野などの専門性の高い分野で用いられる用語を網羅的に登録した辞書が少なく、辞書にのみ依存した言語解析を行っても、抽出されない専門用語が発生する。 However, it is difficult to divide technical terms in highly specialized fields such as the medical field into appropriate morphemes. That is, there are few dictionaries that comprehensively register terms used in highly specialized fields such as the medical field, and technical terms that are not extracted are generated even if language analysis that depends only on the dictionary is performed.

テキストからの用語抽出処理を開示した従来技術しては、例えば以下のような従来技術がある。非特許文献１（Ｓｈｉｍｏｈａｔａ，Ｓ．Ｓｕｇｉｏ，Ｔ．Ｎａｇａｔａ，Ｊ．Ｒｅｔｒｉｅｖｉｎｇｃｏｌｌｏｃａｔｉｏｎｓｂｙｃｏ−ｏｃｃｕｒｒｅｎｃｅｓａｎｄｗｏｒｄｏｒｄｅｒｃｏｎｓｔｒａｉｎｔｓ．Ｐｒｏｃ．ｏｆＡＣＬ／ＥＡＣＬ−９７）には、大規模なテキストコーパスが与えられたときに、コロケーション（連語）を抽出する手法が開示されている。この非特許文献１では、コロケーション（連語）を抽出するにあたって、連語のコーパス内でのエントロピー（情報量）を計算し、両側の単語のエントロピーが設定した閾値を越える連語を抽出する構成である。 Examples of conventional techniques disclosing the term extraction processing from text include the following conventional techniques. Non-Patent Document 1 (Shimohata, S. Sugio, T. Nagata, J. Retrieving collections by co-ocurrenses and word order constraints. Proc. Of ACL scale). Discloses a technique for extracting collocations. This non-patent document 1 is configured to calculate entropy (information amount) in a corpus of collocations when extracting collocations (collocations), and extract collocations where the entropy of words on both sides exceeds a set threshold.

また、特許文献１（特開平９−１３８８０１）には、自然言語で記述されるテキストから、任意の連続文字列をその周辺文字を考慮して抽出する技術を開示している。テキスト中で連続文字列の周辺に現れる文字を抽出し、連続文字列と同時に出現する頻度が、予め設定されている閾値を超えるものは、連続文字列としてまとめて単語や慣用句として抽出する構成である。 Japanese Patent Laid-Open No. 9-138801 discloses a technique for extracting an arbitrary continuous character string from a text described in a natural language in consideration of surrounding characters. A configuration that extracts characters that appear in the vicinity of a continuous character string in the text, and those that appear at the same time as the continuous character string exceed a preset threshold, are extracted as words or idioms together as a continuous character string It is.

また、非特許文献２（中川，森，湯本．出現頻度と連接頻度に基づく専門用語抽出．自然言語処理Ｖｏｌ．１０Ｎｏ．１，２００３年１月）は、専門分野コーパスから専門用語を自動抽出する手法を開示している。コーパスに対して形態素解析を行い、名詞を抽出する構成であり、単名詞もしくは複合名詞のコーパス内での出現頻度と構成する名詞に連接する名詞の頻度を用いてスコアを算出し、単名詞および複合名詞のランキングを行う構成である。 Non-patent document 2 (Nakakawa, Mori, Yumoto. Extraction of technical terms based on appearance frequency and connection frequency. Natural language processing Vol.10 No.1, January 2003) automatically extracts technical terms from a specialized corpus. The method to do is disclosed. It is a configuration that performs morphological analysis on the corpus and extracts nouns, calculates a score using the frequency of appearance of nouns or compound nouns in the corpus and the frequency of nouns connected to the constituent nouns, This is a configuration for ranking compound nouns.

さらに、特許文献２（特開２００６−１３９６８６）は、高精度な未知語抽出を行う技術を開示している。テキストが与えられたときに、事前に用意されたテキスト集合内での文字列の統計量を用いてテキストの単語並びを求め文字列集合を抽出する処理と、形態素解析を行い形態素集合を得る処理との２つの処理を行う構成であり、これらの処理によって、得られた文字列のうち、得られた特定の品詞の形態素に含まれるものは除いて、統計量が一定以上のものを未知語として抽出する構成である。 Further, Patent Document 2 (Japanese Patent Laid-Open No. 2006-139686) discloses a technique for performing unknown word extraction with high accuracy. When text is given, processing to extract word set by obtaining word sequence of text using statistic of character string in text set prepared in advance and processing to obtain morpheme set by morphological analysis These processes are used to perform the two processes, and the character strings obtained by these processes, except for those included in the morpheme of the specific part-of-speech obtained, those whose statistics are above a certain level are unknown words It is the structure extracted as.

さらに、特許文献３（特開２００６−３１２９５）には、単語分割済みの第１のテキスト集合と単語非分割の第２のテキスト集合から、単語のｎ−ｇｒａｍ確率を計算し、自然言語処理の精度を向上させる技術が開示されている。第１のテキスト集合から、隣り合う文字もしくは文字種が単語の境界になる確率を求め、第２のコーパスにおいて各文字間の間に分割確率を割り当て、第２のコーパスに対する自然言語処理の精度を向上させる構成である。
特開平９−１３８８０１号公報特開２００６−１３９６８６号公報特開２００６−３１２９５号公報Ｓｈｉｍｏｈａｔａ，Ｓ．Ｓｕｇｉｏ，Ｔ．Ｎａｇａｔａ，Ｊ．Ｒｅｔｒｉｅｖｉｎｇｃｏｌｌｏｃａｔｉｏｎｓｂｙｃｏ−ｏｃｃｕｒｒｅｎｃｅｓａｎｄｗｏｒｄｏｒｄｅｒｃｏｎｓｔｒａｉｎｔｓ．Ｐｒｏｃ．ｏｆＡＣＬ／ＥＡＣＬ−９７．中川，森，湯本．出現頻度と連接頻度に基づく専門用語抽出．自然言語処理Ｖｏｌ．１０Ｎｏ．１，２００３年１月 Further, in Patent Document 3 (Japanese Patent Laid-Open No. 2006-3295), n-gram probabilities of words are calculated from a first text set that has been divided into words and a second text set that has not been divided into words. A technique for improving accuracy is disclosed. The probability of adjacent characters or character types becoming word boundaries is determined from the first text set, and a division probability is assigned between each character in the second corpus to improve the accuracy of natural language processing for the second corpus It is the structure to make.
Japanese Patent Laid-Open No. 9-138801 JP 2006-139686 A JP 200631295 A Shimohata, S .; Sugio, T .; Nagata, J .; Retrieving associations by co-ocurrences and word order constraints. Proc. of ACL / EACL-97. Nakagawa, Mori, Yumoto. Terminology extraction based on appearance frequency and connection frequency. Natural language processing Vol. 10 No. 1,2003 January

本発明は、テキスト解析による用語の抽出を行なう構成において、例えば形態素解析用の辞書等の辞書に登録されていない用語についても抽出を行なうことを可能とする言語解析システム、および言語解析方法、並びにコンピュータ・プログラムを提供することを目的とする。 The present invention relates to a language analysis system, a language analysis method, and a language analysis method capable of extracting even a term not registered in a dictionary such as a morphological analysis dictionary in a configuration for extracting a term by text analysis, and The object is to provide a computer program.

本発明の第１の側面は、
テキストデータから予め定めた文字数以下の文字列の集合を抽出する文字列抽出部と、
テキストデータベース内のテキストを対象とした解析処理により、前記文字列抽出部の抽出した文字列を解析対象文字列として、該解析対象文字列の境界における出現文字列の分岐状態を表す評価値を算出する分岐状態評価値算出部と、
前記分岐状態評価値算出部の算出した評価値に基づいて解析対象文字列に対応するスコアを設定するスコア設定部と、
前記スコア設定部の設定したスコアに基づいて前記解析対象文字列が単語であるか否かを判定する単語判定部と、
を有することを特徴とする言語解析システムにある。 The first aspect of the present invention is:
A character string extraction unit that extracts a set of character strings having a predetermined number of characters or less from text data;
Using the analysis processing for text in the text database, the character string extracted by the character string extraction unit is used as the analysis target character string, and an evaluation value representing the branch state of the appearance character string at the boundary of the analysis target character string is calculated. A branch state evaluation value calculation unit for
A score setting unit that sets a score corresponding to the analysis target character string based on the evaluation value calculated by the branch state evaluation value calculation unit;
A word determination unit that determines whether or not the analysis target character string is a word based on the score set by the score setting unit;
A language analysis system characterized by having

さらに、本発明の言語解析システムの一実施態様において、前記分岐数算出部で算出される評価値は、解析対象文字列の両端部のそれぞれ１文字以上における解析対象文字列の外方向の出現文字列の分岐状態を示す評価値と、前記解析対象文字列に接する両端部のそれぞれ１文字以上における解析対象文字列方向の出現文字列の分岐状態を示す評価値とを算出する構成であり、前記スコア設定部は、前記分岐数算出部の算出した評価値に基づいてスコアを算出する構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the evaluation value calculated by the branch number calculation unit is an outward appearance character of the analysis target character string at one or more characters at each end of the analysis target character string. An evaluation value indicating a branching state of a column, and an evaluation value indicating a branching state of an appearance character string in an analysis target character string direction at each of one or more characters at both ends contacting the analysis target character string, The score setting unit is configured to calculate a score based on the evaluation value calculated by the branch number calculation unit.

さらに、本発明の言語解析システムの一実施態様において、前記スコア設定部は、解析対象文字列の両端部の１文字以上における解析対象文字列の外方向の出現文字列の分岐状態を示す評価値と、前記解析対象文字列に接する両端部の１文字以上における解析対象文字列方向の出現文字列の分岐状態を示す評価値とのうちの最小値に基づいてスコアを算出する構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the score setting unit is an evaluation value indicating a branching state of the outward appearance character string of the analysis target character string at one or more characters at both ends of the analysis target character string And a score is calculated based on the minimum value of the evaluation values indicating the branching state of the appearance character string in the direction of the analysis target character string in one or more characters at both ends in contact with the analysis target character string. Features.

さらに、本発明の言語解析システムの一実施態様において、前記言語解析システムは、さらに、単語データベース内に登録された単語をｍ文字（ただしｍ≧１の予め定めた数）単位で分割する単語分割部と、前記分岐状態評価値算出部は、前記単語分割部の分割したｍ文字単位の文字列に対応する解析対象文字列の境界における出現文字列の分岐状態を表す複数の評価値を算出し、前記分岐状態評価値算出部の算出した複数の評価値から、ｍ文字単位の文字列端部の外部方向の出現文字列の分岐状態を示す評価値を除く複数の評価値の平均値を閾値として算出する閾値設定部を有し、前記単語判定部は、前記スコア設定部の設定したスコアと、前記閾値設定部の設定した閾値との比較を実行して、比較結果に応じて前記解析対象文字列が単語であるとの判定を行なう構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system according to the present invention, the language analysis system further divides a word registered in the word database into units of m characters (where m ≧ 1 is a predetermined number). And the branch state evaluation value calculation unit calculates a plurality of evaluation values representing the branch state of the character string appearing at the boundary of the character string to be analyzed corresponding to the character string in m characters divided by the word dividing unit. The average value of a plurality of evaluation values excluding the evaluation value indicating the branching state of the externally appearing character string at the end of the character string in m characters from the plurality of evaluation values calculated by the branch state evaluation value calculation unit And the word determination unit performs a comparison between the score set by the score setting unit and the threshold set by the threshold setting unit, and the analysis target according to a comparison result String is word Characterized in that it is configured for judging that there.

さらに、本発明の言語解析システムの一実施態様において、前記言語解析システムは、さらに、前記文字列抽出部の抽出した文字列の集合から、単語として成立しない文字列を削除するフィルタリング処理を実行する文字列フィルタ部を有し、前記分岐状態評価値算出部は、前記文字列フィルタ部におけるフィルタリング後の文字列を解析対象文字列として、該解析対象文字列の境界における出現文字列の分岐状態を表す評価値を算出する処理を実行する構成であることを特徴とする。 Furthermore, in an embodiment of the language analysis system of the present invention, the language analysis system further executes a filtering process for deleting a character string that does not hold as a word from the set of character strings extracted by the character string extraction unit. The branch state evaluation value calculation unit includes a character string after filtering in the character string filter unit as the analysis target character string, and the branch state of the appearance character string at the boundary of the analysis target character string. It is the structure which performs the process which calculates the evaluation value to represent.

さらに、本発明の言語解析システムの一実施態様において、前記言語解析システムは、さらに、前記文字列抽出部の抽出した抽出文字列から、抽出文字列の先頭および末尾文字列を抽出する先頭末尾文字列抽出部を有し、前記分岐状態評価値算出部は、前記先頭末尾文字列抽出部の抽出した先頭末尾文字列に基づいて、前記解析対象文字列の境界における出現文字列の分岐状態を表す評価値の算出を実行する構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the language analysis system further includes a head end character for extracting a head character and a tail character string of the extracted character string from the extracted character string extracted by the character string extraction unit. The branch state evaluation value calculation unit represents a branch state of the appearance character string at the boundary of the analysis target character string based on the head / end character string extracted by the head / end character string extraction unit; It is the structure which performs calculation of an evaluation value, It is characterized by the above-mentioned.

さらに、本発明の言語解析システムの一実施態様において、前記言語解析システムは、さらに、前記文字列抽出部の抽出した抽出文字列から、抽出文字列の部分文字列を抽出する部分文字列抽出部を有し、前記分岐状態評価値算出部は、前記部分文字列抽出部の抽出した部分文字列に基づいて、前記解析対象文字列の内部における出現文字列の分岐状態を表す評価値の算出を実行し、前記単語判定部は、前記スコア設定部の設定したスコアと、前記部分文字列抽出部の抽出した部分文字列に関する前記分岐状態評価値算出部が算出した評価値に基づいて設定した内部スコアとに基づいて前記解析対象文字列が単語であるか否かを判定する処理を実行する構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the language analysis system further includes a partial character string extraction unit that extracts a partial character string of the extracted character string from the extracted character string extracted by the character string extraction unit. The branch state evaluation value calculation unit calculates an evaluation value representing a branch state of the appearance character string inside the analysis target character string based on the partial character string extracted by the partial character string extraction unit. The word determination unit executes the internal set based on the score set by the score setting unit and the evaluation value calculated by the branch state evaluation value calculation unit regarding the partial character string extracted by the partial character string extraction unit It is the structure which performs the process which determines whether the said analysis object character string is a word based on a score.

さらに、本発明の言語解析システムの一実施態様において、前記言語解析システムは、テキストデータベースに格納されたテキスト単位で単語抽出処理を実行する構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the language analysis system is configured to execute word extraction processing in units of text stored in a text database.

さらに、本発明の言語解析システムの一実施態様において、前記分岐状態評価値算出部において算出される評価値が、解析対象文字列の境界における出現文字列の分岐数を表すパープレキシティであることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the evaluation value calculated by the branch state evaluation value calculation unit is a perplexity that represents the number of branches of the appearance character string at the boundary of the analysis target character string. It is characterized by.

さらに、本発明の第２の側面は、
言語解析システムによるテキスト解析に基づいて抽出された単語データを登録した辞書であり、
前記言語解析システムに、
テキストデータから予め定めた文字数以下の文字列の集合を抽出する文字列抽出部と、
テキストデータベース内のテキストを対象とした解析処理により、前記文字列抽出部の抽出した文字列を解析対象文字列として、該解析対象文字列の境界における出現文字列の分岐状態を表す評価値を算出する分岐状態評価値算出部と、
前記分岐状態評価値算出部の算出した評価値に基づいて解析対象文字列に対応するスコアを設定するスコア設定部と、
前記スコア設定部の設定したスコアに基づいて前記解析対象文字列が単語であるか否かを判定する単語判定部と、
を有し、
前記単語判定部において単語として判定された単語データを登録データとして有する辞書にある。 Furthermore, the second aspect of the present invention provides
It is a dictionary that registers word data extracted based on text analysis by a language analysis system,
In the language analysis system,
A character string extraction unit that extracts a set of character strings having a predetermined number of characters or less from text data;
Using the analysis processing for text in the text database, the character string extracted by the character string extraction unit is used as the analysis target character string, and an evaluation value representing the branch state of the appearance character string at the boundary of the analysis target character string is calculated. A branch state evaluation value calculation unit for
A score setting unit that sets a score corresponding to the analysis target character string based on the evaluation value calculated by the branch state evaluation value calculation unit;
A word determination unit that determines whether or not the analysis target character string is a word based on the score set by the score setting unit;
Have
It exists in the dictionary which has the word data determined as a word in the said word determination part as registration data.

さらに、本発明の第３の側面は、
言語解析システムにおいて言語解析処理を実行する言語解析方法であり、
文字列抽出部が、テキストデータから予め定めた文字数以下の文字列の集合を抽出する文字列抽出ステップと、
分岐数算出部が、テキストデータベース内のテキストを対象とした解析処理により、前記文字列抽出ステップにおいて抽出した文字列を解析対象文字列として、該解析対象文字列の境界における出現文字列の分岐状態を表す評価値を算出する分岐状態評価値算出ステップと、
スコア設定部が、前記分岐状態評価値算出ステップで算出した評価値に基づいて解析対象文字列に対応するスコアを設定するスコア設定ステップと、
単語判定部が、前記スコア設定ステップで設定したスコアに基づいて前記解析対象文字列が単語であるか否かを判定する単語判定ステップと、
を有することを特徴とする言語解析方法にある。 Furthermore, the third aspect of the present invention provides
A language analysis method for executing language analysis processing in a language analysis system,
A character string extraction step in which the character string extraction unit extracts a set of character strings having a predetermined number of characters or less from the text data;
The branch number calculation unit uses the character string extracted in the character string extraction step by the analysis processing for the text in the text database as the analysis target character string, and the branch state of the appearance character string at the boundary of the analysis target character string A branch state evaluation value calculating step for calculating an evaluation value representing
A score setting step in which the score setting unit sets a score corresponding to the character string to be analyzed based on the evaluation value calculated in the branch state evaluation value calculation step;
A word determination step in which a word determination unit determines whether or not the analysis target character string is a word based on the score set in the score setting step;
A language analysis method characterized by comprising:

さらに、本発明の第４の側面は、
言語解析システムにおいて言語解析処理を実行させるコンピュータ・プログラムであり、
文字列抽出部に、テキストデータから予め定めた文字数以下の文字列の集合を抽出させる文字列抽出ステップと、
分岐数算出部に、テキストデータベース内のテキストを対象とした解析処理により、前記文字列抽出ステップにおいて抽出した文字列を解析対象文字列として、該解析対象文字列の境界における出現文字列の分岐状態を表す評価値を算出させる分岐状態評価値算出ステップと、
スコア設定部に、前記分岐状態評価値算出ステップで算出した評価値に基づいて解析対象文字列に対応するスコアを設定させるスコア設定ステップと、
単語判定部に、前記スコア設定ステップで設定したスコアに基づいて前記解析対象文字列が単語であるか否かを判定させる単語判定ステップと、
を実行させることを特徴とするコンピュータ・プログラムにある。 Furthermore, the fourth aspect of the present invention provides
A computer program that executes language analysis processing in a language analysis system,
A character string extraction step for causing the character string extraction unit to extract a set of character strings having a predetermined number of characters or less from the text data;
In the branch number calculation unit, the character string extracted in the character string extraction step by the analysis processing for text in the text database is set as the analysis target character string, and the branch state of the appearance character string at the boundary of the analysis target character string A branch state evaluation value calculating step for calculating an evaluation value representing
A score setting step for causing the score setting unit to set a score corresponding to the analysis target character string based on the evaluation value calculated in the branch state evaluation value calculating step;
A word determination step for causing a word determination unit to determine whether or not the analysis target character string is a word based on the score set in the score setting step;
In a computer program characterized by causing

本発明のさらに他の目的、特徴や利点は、後述する本発明の実施例や添付する図面に基づくより詳細な説明によって明らかになるであろう。なお、本明細書においてシステムとは、複数の装置の論理的集合構成であり、各構成の装置が同一筐体内にあるものには限らない。 Other objects, features, and advantages of the present invention will become apparent from a more detailed description based on embodiments of the present invention described later and the accompanying drawings. In this specification, the system is a logical set configuration of a plurality of devices, and is not limited to one in which the devices of each configuration are in the same casing.

請求項１記載の発明によれば、辞書に登録された単語を用いて登録されていない単語を抽出する手法と比較して、出現頻度が少なくても単語としてより正当な文字列を抽出することができる。
請求項２記載の発明によれば、抽出される文字列の境界をより正確に判定できることから、単語としてより正当な文字列が抽出できる。
請求項３記載の発明によれば、さらに出現文字列の分岐状態の多様性が小さい境界が文字列の端部やこの端部に隣接している場合であっても、単語としてより正当な文字列を抽出することができる。
請求項４記載の発明によれば、所定文字数で分割された文字列に関する閾値がより妥当に設定されるようになる。
請求項５記載の発明によれば、不要な分岐状態の評価処理が低減される。
請求項６記載の発明によれば、文字列の抽出を、単語の区切り位置を判断するのに妥当性の高い、文字列の境界における分岐状態に基づいて評価することができる。
請求項７記載の発明によれば、文字列の抽出を、抽出文字列の部分文字列が内部に存在するかどうかに基づいて評価することができる。
請求項８記載の発明によれば、所定数の単語が登録された辞書と比べて、入手が容易で情報量も豊富なテキストを用いることができるので広い分野に適用可能で、また出現頻度の少ない単語としてより正当な文字列の抽出が可能となる。
請求項９記載の発明によれば、文字列抽出に関わる条件設定や調整が容易なシステムを提供できる。
請求項１０記載の発明によれば、これまで登録されていなかった文字列を辞書の登録語として利用できるようになる。
請求項１１記載の発明によれば、辞書に登録された単語を用いて登録されていない単語を抽出する手法と比較して、出現頻度が少なくても単語としてより正当な文字列を抽出することができる。
請求項１２記載の発明によれば、辞書に登録された単語を用いて登録されていない単語を抽出する手法と比較して、出現頻度が少なくても単語としてより正当な文字列を抽出することができるコンピュータ・プログラムを提供できる。 According to the first aspect of the present invention, it is possible to extract a character string more legitimate as a word even if the appearance frequency is low as compared with a method of extracting a word that is not registered using a word registered in a dictionary. Can do.
According to the invention described in claim 2, since the boundary of the extracted character string can be determined more accurately, a character string more legitimate as a word can be extracted.
According to the third aspect of the present invention, even if the boundary where the diversity of the branch state of the appearing character string is small is adjacent to the end of the character string or this end, the character more legitimate as the word A column can be extracted.
According to the fourth aspect of the present invention, the threshold for the character string divided by the predetermined number of characters is set more appropriately.
According to the fifth aspect of the present invention, unnecessary branch state evaluation processing is reduced.
According to the sixth aspect of the present invention, the extraction of the character string can be evaluated based on the branch state at the boundary of the character string, which is highly appropriate for determining the word break position.
According to the seventh aspect of the present invention, the extraction of the character string can be evaluated based on whether or not the partial character string of the extracted character string exists inside.
According to the eighth aspect of the present invention, it is possible to use texts that are easily available and abundant in amount of information as compared with a dictionary in which a predetermined number of words are registered. It is possible to extract more legitimate character strings as fewer words.
According to the ninth aspect of the present invention, it is possible to provide a system that can easily set and adjust conditions related to character string extraction.
According to the invention described in claim 10, a character string that has not been registered can be used as a registered word in the dictionary.
According to the eleventh aspect of the present invention, a more legitimate character string can be extracted as a word even if the frequency of occurrence is low compared to a method of extracting a word that is not registered using a word registered in the dictionary. Can do.
According to the twelfth aspect of the present invention, compared to a method of extracting a word that is not registered using a word registered in the dictionary, a character string that is more legitimate as a word is extracted even if the appearance frequency is low. Can provide a computer program that can

以下、図面を参照しながら本発明の実施形態に係る言語解析システム、および言語解析方法、並びにコンピュータ・プログラムの詳細について説明する。 Hereinafter, a language analysis system, a language analysis method, and a computer program according to embodiments of the present invention will be described in detail with reference to the drawings.

［実施例１］
図１を参照して、本発明の一実施形態に係る言語解析システムの構成および処理について説明する。図１に示すように本発明の一実施形態に係る言語解析システム１００は、テキスト入力部１０１、文字列抽出部１０２、文字列フィルタ部１０３、先頭・末尾文字列抽出部１０４、周辺文字列抽出部１０５、分岐数算出部１０６、単語分割部１０７、スコア設定部１０８、閾値設定部１０９、単語判定部１１０、さらにテキストデータベース１２１、単語データベース１２２を有する。 [Example 1]
With reference to FIG. 1, the configuration and processing of a language analysis system according to an embodiment of the present invention will be described. As shown in FIG. 1, a language analysis system 100 according to an embodiment of the present invention includes a text input unit 101, a character string extraction unit 102, a character string filter unit 103, a head / end character string extraction unit 104, and a peripheral character string extraction. Unit 105, branch number calculation unit 106, word division unit 107, score setting unit 108, threshold setting unit 109, word determination unit 110, text database 121, and word database 122.

本実施例に係る言語解析システムは、解析対象とする文書テキストからの単語抽出処理を行なうシステムであり、辞書に登録されていない単語や、テキスト中での出現頻度の高くない単語を抽出することを可能とした構成を持つ。すなわち、単語抽出を行なう解析対象テキスト集合が与えられたときに、文字列を形態素解析の結果やテキスト集合内での出現頻度によらずに、未知語として抽出することを可能とする装置である。 The language analysis system according to the present embodiment is a system that performs word extraction processing from document text to be analyzed, and extracts words that are not registered in the dictionary or words that do not appear frequently in the text. It has a configuration that enables. In other words, when an analysis target text set for word extraction is given, it is an apparatus that can extract a character string as an unknown word regardless of the result of morphological analysis or the appearance frequency in the text set. .

一般的な単語抽出処理としては、例えば形態素解析により単語登録辞書（標準辞書）を参照して辞書に登録された単語をテキストから選択する処理がある。また、例えば医学分野など有効な専門領域辞書が利用できない場合は、専門領域のテキスト集合に頻出する単語を辞書登録データと同様の単語とみなして未知語を抽出するといった処理が行なわれることもある。しかし、このような構成では、辞書に登録されていない語や、テキスト集合内での出現頻度の小さい文字列を単語として抽出するのは困難である。 As typical word extraction processing, for example, there is processing for selecting a word registered in a dictionary from text by referring to a word registration dictionary (standard dictionary) by morphological analysis. In addition, for example, when an effective special area dictionary such as the medical field cannot be used, a process of extracting an unknown word by regarding a word frequently appearing in the text set of the special area as a word similar to the dictionary registration data may be performed. . However, with such a configuration, it is difficult to extract words that are not registered in the dictionary or character strings that appear less frequently in the text set as words.

図１に示す本実施例に係る言語解析システム１００は、辞書の登録データに依存することなく、また、特定のテキスト集合内での出現頻度の小さい文字列であっても単語として抽出することを実現する。以下、図１に示す言語解析システム１００の処理の詳細について説明する。 The language analysis system 100 according to the present embodiment shown in FIG. 1 does not depend on dictionary registration data, and extracts even a character string having a low appearance frequency in a specific text set as a word. Realize. Details of the processing of the language analysis system 100 shown in FIG. 1 will be described below.

まず、図１に示す言語解析システム１００において適用されるテキストデータベース１２１、単語データベース１２２の構成について図２を参照して説明する。テキストデータベース１２１は、図２（ａ）に示すように様々なテキストを格納したデータベースであり、各テキストには識別子としてのＩＤが設定されている。本実施例では、医療分野のテキストを集めたテキストデータベースを利用した例について説明する。 First, the configuration of the text database 121 and the word database 122 applied in the language analysis system 100 shown in FIG. 1 will be described with reference to FIG. The text database 121 is a database storing various texts as shown in FIG. 2A, and an ID as an identifier is set for each text. In the present embodiment, an example using a text database in which texts in the medical field are collected will be described.

単語データベース１２２は、形態素解析の結果として得られる単語としての形態素を示す［表層］と、その単語［表層］の記録されたデータとしての辞書やテキストの［種類］とを対応付けたデータを登録しており各登録エントリには識別子としての［ＩＤ］が設定されている。 The word database 122 registers data associating [surface layer] indicating a morpheme as a word obtained as a result of morpheme analysis with a dictionary or text [type] as data recorded for the word [surface layer]. In each registered entry, [ID] as an identifier is set.

図１に示す本実施例に係る言語解析システムでは、単語抽出処理を行なう前の処理として閾値設定処理を実行する。以下、本実施例に係る言語解析システムにおいて実行する「閾値設定処理」と、「単語抽出処理」について、順次説明する。 In the language analysis system according to the present embodiment shown in FIG. 1, a threshold setting process is executed as a process before the word extraction process is performed. Hereinafter, “threshold setting processing” and “word extraction processing” executed in the language analysis system according to the present embodiment will be described in order.

［閾値設定処理］
図１に示す本実施例に係る言語解析システム１００を適用したテキストからの単語抽出処理に際しては、まず、単語データベース１２２に登録された単語に基づく閾値設定処理を実行する。この処理は、図１に示す言語解析システム１００の単語分割部１０７、分岐数算出部１０６、閾値設定部１０９の処理として実行される。本実施例における分岐状態の評価値としては、分岐数の指標であるパープレキシティを用いる。文字列のパープレキシティの計算方法は、例えば、北研二．確率的言語モデル．東京大学出版会，１９９９」等に詳細が記載されている。まず、この閾値設定処理について、図３に示すフローチャートを参照して説明する。 [Threshold setting process]
In the word extraction processing from the text to which the language analysis system 100 according to the present embodiment shown in FIG. 1 is applied, first, threshold setting processing based on the words registered in the word database 122 is executed. This processing is executed as processing of the word division unit 107, the branch number calculation unit 106, and the threshold setting unit 109 of the language analysis system 100 shown in FIG. As an evaluation value of the branch state in this embodiment, perplexity that is an index of the number of branches is used. The calculation method of string perplexity is, for example, Kitakenji. Stochastic language model. Details are described in “The University of Tokyo Press, 1999”. First, the threshold setting process will be described with reference to the flowchart shown in FIG.

まず、単語分割部１０７はステップＳ１０１において、単語データベース１２２に登録された単語を、表層のｍ文字で分割し、「異なり文字列」の集合を抽出する。ｍは予め１〜３位の小さな値を設定しておく。図２（ｂ）に示す単語データベース内の登録データである、
［リンパ節］
［手関節骨折］
これらの［表層］データを対象とした単語分割処理について説明する。 First, in step S101, the word dividing unit 107 divides a word registered in the word database 122 by m characters on the surface layer, and extracts a set of “different character strings”. m is set to a small value in the first to third places. It is registration data in the word database shown in FIG.
[Lymph node]
[Hand fractures]
A word division process for these [surface] data will be described.

ｍ＝１で異なり文字列集合を抽出すると、
［リンパ節］＝"リ"、"ン"、"パ"、"節"
［手関節骨折］＝手"、"間"、"接"、"骨"、"折"
これらの「異なり文字列」の集合が抽出される。 When m = 1 and different character string sets are extracted,
[Lymph node] = "li", "n", "pa", "node"
[Hand joint fracture] = "hand", "between", "contact", "bone", "fold"
A set of these “different character strings” is extracted.

次のステップＳ１０２では、分岐数算出部（パープレキシティ算出部）１０６において、ステップＳ１０１で抽出した「異なり文字列」のパープレキシティを計算する。 In the next step S102, the branch number calculation unit (perplexity calculation unit) 106 calculates the perplexity of the “different character string” extracted in step S101.

文字列のパープレキシティは、ある文字列［Ｗ］に対して左右に出現する文字列のエントロピーの値に基づいて算出される。まず、以下のエントロピー算出式において、文字列［Ｗ］に対して左右に出現する文字列のエントロピーの値を求める。

The perplexity of the character string is calculated based on the entropy value of the character string that appears to the left and right with respect to a certain character string [W]. First, in the entropy calculation formula below, the entropy value of a character string appearing on the left and right with respect to the character string [W] is obtained.

上記式において、
Ｗ_Ｌ：左側に現れる文字列集合
Ｗ_Ｒ：右側に現れる文字列集合
ｎ：集合の文字列数
である。
上記式によって、以下のエントロピー値が算出される。
Ｈ（Ｗ_Ｌ）：文字列［Ｗ］に対して左に出現する文字列のエントロピーの値、
Ｈ（Ｗ_Ｒ）：文字列［Ｗ］に対して右に出現する文字列のエントロピーの値、 In the above formula,
W _L : Character string set appearing on the left side W _R : Character string set appearing on the right side n: Number of character strings in the set
The following entropy value is calculated by the above formula.
H (W _L ): Entropy value of the character string appearing on the left with respect to the character string [W],
H (W _R ): the value of the entropy of the character string that appears to the right of the character string [W],

次に、文字列［Ｗ］のパープレキシティを以下の計算式によって算出する。

Next, the perplexity of the character string [W] is calculated by the following calculation formula.

上記式において、
ＰＰ（Ｗ_Ｌ）：文字列［Ｗ］に対する左の文字列のパープレキシティ、
ＰＰ（Ｗ_Ｒ）：文字列［Ｗ］に対する右の文字列のパープレキシティ、
である。 In the above formula,
PP (W _L ): Perplexity of the left character string with respect to the character string [W],
PP (W _R ): right string perplexity with respect to the string [W],
It is.

パープレキシティは、ある文字列［Ｗ］に対して左右に出現する文字列の多様性を示す値である。パープレキシティは言語モデルにおける評価指標としてよく用いられており、本実施例では分岐数を示す指標として数２で定義されるパープレキシティを用いる。パープレキシティの値が高い場合は、ある文字列［Ｗ］に対して左右に出現する文字列が多様であることを示し、パープレキシティの値が低い場合は、ある文字列［Ｗ］に対して左右に出現する文字列が多様でない、すなわち限定されることを意味する。 The perplexity is a value indicating the diversity of character strings appearing on the left and right with respect to a certain character string [W]. Perplexity is often used as an evaluation index in a language model. In this embodiment, perplexity defined by Equation 2 is used as an index indicating the number of branches. When the perplexity value is high, it indicates that there are various character strings appearing on the left and right with respect to a certain character string [W]. When the perplexity value is low, the character string [W] On the other hand, it means that the character strings appearing on the left and right are not diverse, that is, limited.

次に、ステップＳ１０３において、閾値設定部１０９が閾値設定処理を実行する。ステップＳ１０２において計算されたパープレキシティの中で、単語の先頭の左側および末尾の右側にしか表れない文字列を除いた平均を取り、閾値［ｔ］として設定する。例えば、ｍ＝１で単語が"リンパ節"の場合、
"リ"、"ン"、"パ"の右側と、
"ン"、"パ"、"節"の左側、
これらの各パープレキシティの値のみを用いてパープレキシティ平均値を算出して、この算出したパープレキシティ平均値を閾値［ｔ］とする。"リ"の左側と"節"の右側のパープレキシティは平均の計算には利用しない。
なお、上述した閾値の計算方法は一例であり、その他の閾値算出手法を適用してもよい。例えば、平均の変わりに中点を用いてもよいし、また閾値を左右２つ別々に計算してもよい。 Next, in step S103, the threshold setting unit 109 executes threshold setting processing. In the perplexity calculated in step S102, an average is calculated by excluding character strings that appear only on the left side at the beginning and the right side at the end of the word, and set as a threshold value [t]. For example, if m = 1 and the word is “lymph node”,
To the right of "Li", "N", "Pa",
"N", "Pa", the left side of "Section",
A perplexity average value is calculated using only these perplexity values, and the calculated perplexity average value is defined as a threshold value [t]. The perplexity on the left side of "Li" and the right side of "Clause" is not used for calculating the average.
The threshold value calculation method described above is an example, and other threshold value calculation methods may be applied. For example, a midpoint may be used instead of the average, and the threshold values may be calculated separately on the left and right sides.

上述した処理によって閾値［ｔ］を算出して、この閾値［ｔ］を用いて、単語抽出処理が行われることになる。 The threshold value [t] is calculated by the above-described process, and the word extraction process is performed using the threshold value [t].

［単語抽出処理］
次に、図１に示す言語解析システム１００において実行する単語抽出処理の詳細について、図４に示すフローチャートを参照して説明する。 [Word extraction processing]
Next, details of the word extraction processing executed in the language analysis system 100 shown in FIG. 1 will be described with reference to the flowchart shown in FIG.

まず、ステップＳ２０１において、単語の抽出処理対象としてのテキストがシステムのテキスト入力部１０１に入力される。例えば、本実施例では医学分野のある１つのテキスト「最新解剖学用語集」に含まれるテキストに対する処理例について説明する。
例えば、テキスト「最新解剖学用語集」には、各エントリが改行で区切られている以下のようなテキストが含まれる。
（テキスト例）
「翼口蓋神経節の副交感神経根
右肺の内側肺底枝（Ｂ７）
後側頭板間静脈
強膜静脈洞
・・・」 First, in step S201, text as a word extraction process target is input to the text input unit 101 of the system. For example, in this embodiment, a processing example for a text included in one text “latest anatomy glossary” in the medical field will be described.
For example, the text “Latest Anatomy Glossary” includes the following text in which each entry is separated by a line feed.
(Text example)
“Parasympathetic nerve roots of the wing palate ganglion Right inner lung branch of the right lung (B7)
Posterior intertemporal vein scleral sinus ... "

次に、ステップＳ２０２において、文字列抽出部１０２が、入力されたテキストを改行・句点等の区切り記号で分割し、分割された各テキストから取りえる全ての部分文字列を抽出する。このとき、重複は除く。また、抽出量を抑えるために部分文字列の最大長を設定してもよい。例えば、
"強膜静脈洞"
から最大長３で部分文字列を抽出すると以下のような抽出文字列が取得される。
｛強，膜，静，脈，洞，強膜，膜静，静脈，脈洞，強膜静，膜静脈，静脈洞｝ Next, in step S202, the character string extraction unit 102 divides the input text with delimiters such as line breaks and punctuation marks, and extracts all partial character strings that can be taken from each divided text. At this time, duplicates are excluded. Further, the maximum length of the partial character string may be set in order to suppress the extraction amount. For example,
"Scleral venous sinus"
When a partial character string is extracted with a maximum length of 3, the following extracted character string is obtained.
{Strong, membranous, static, pulse, sinus, sclera, membranous, vein, sinus, scleral, membranous vein, sinus}

次に、ステップＳ２０３において、文字列フィルタリング部１０３において、文字列フィルタリング処理を実行する。ステップＳ２０２において抽出された文字列の内、特定の文字パターンを含むもの、特定の文字で始まるもの、特定の文字で終わるもの、単語データベース１２２に含まれるもの等を削除する。例えば、
（ａ）"が、を、する、と、に、または、による、の、される、のための、および、からの、における、との、への"を平仮名のまとまりとして含むもの、
（ｂ）"ぁぃぅぇぉっゃゅょァィゥェォヵヶッャュョんンー・＋−／％〜：；"のいずれかの文字で始まるもの、
（ｃ）"・，、〜"のいずれかの文字で終わるもの、
これらは基本的に日本語の単語にはなりえないので削除する。文字列フィルタリング部１０３は予め削除する文字列情報を登録情報として保持し、これらの登録情報を適用して文字列フィルタリングを実行する。 In step S203, the character string filtering unit 103 executes character string filtering processing. Among the character strings extracted in step S202, those including a specific character pattern, those starting with a specific character, those ending with a specific character, those included in the word database 122, and the like are deleted. For example,
(A) including, as a collective of Hiragana, "to," to, to, by, for, and from in to
(B) Things that begin with one of the characters “Ai ぅ e ぉ yayyayuyakyakyyayen + − /% ~ :;”,
(C) Those ending with any of the characters "., ~"
Since these cannot basically be Japanese words, they are deleted. The character string filtering unit 103 stores character string information to be deleted in advance as registration information, and executes character string filtering by applying the registration information.

また、本実施例では、辞書に登録されていない単語の抽出処理を目的として実行しているので、文字列フィルタリング部１０３は、ステップＳ２０２において抽出された文字列の内、
（ｄ）既に辞書に登録されている単語、
についても削除する処理を実行する。
本処理例では、「最新解剖学用語集」のエントリや形態素解析の標準的な辞書として知られているＩＰＡ辞書（ｈｔｔｐ：／／ｃｈａｓｅｎ．ｎａｉｓｔ．ｊｐ／ｈｉｋｉ／ＣｈａＳｅｎ／）のエントリを予め単語データベース１２２に登録しておき、一致する文字列を削除する。 Further, in the present embodiment, since it is executed for the purpose of extracting words that are not registered in the dictionary, the character string filtering unit 103 includes the character strings extracted in step S202.
(D) a word already registered in the dictionary;
The process of deleting is also executed.
In this processing example, an entry of “latest anatomical terminology” or an IPA dictionary (http://chasen.naist.jp/hiki/ChaSen/), which is known as a standard dictionary for morphological analysis, is pre-worded. It is registered in the database 122 and the matching character string is deleted.

さらに、従来手法においては抽出されない単語の抽出のみを目的とする場合には、
（ｅ）従来手法において抽出可能な単語であると判定される文字列、
についての削除を行なう構成としてもよい。
例えば、テキストデータベース１２１に格納されたテキスト内での出現頻度の大きい文字列については、従来手法を適用した処理によっても抽出可能な単語であり、これらの単語に相当する文字列を削除してもよい。
なお、従来手法において抽出可能な単語であると判定される文字列についての削除は行わない構成としてもよい。この場合は、これらの文字列についても、本実施例に従った処理を適用して単語として抽出することができる。 Furthermore, when the purpose is only to extract words that are not extracted by the conventional method,
(E) a character string determined to be a word that can be extracted by a conventional method;
It is good also as a structure which deletes about.
For example, a character string having a high appearance frequency in the text stored in the text database 121 is a word that can be extracted even by a process to which a conventional method is applied, and even if a character string corresponding to these words is deleted. Good.
In addition, it is good also as a structure which does not delete about the character string determined to be the word which can be extracted in a conventional method. In this case, these character strings can also be extracted as words by applying the processing according to the present embodiment.

ステップＳ２０３において、文字列フィルタリング部１０３は、このようなフィルタリング処理を実行する。本実施例では、ステップＳ２０２において抽出された文字列から、
（１）基本的に日本語の単語にはなりえない文字列（上記（ａ）〜（ｃ））、
（２）既に辞書に登録されている単語（上記（ｄ））、
（３）テキストデータベース１２１に格納されたテキスト内での出現頻度の大きい文字列（上記（ｅ））、
これらの文字列を削除する。このフィルタリング処理の結果として、例えば、ステップＳ２０２において抽出された文字列から、上記（１）〜（３）に該当する文字列が削除され、その他の文字列が解析対象文字列として選択されることになる。ここでは、「最新解剖学用語集から抽出された出現頻度５以下の部分文字列」を解析対象文字列集合とする。例えば、以下の文字列が抽出される。
（解析対象文字列集合）
ＢＩＯ
下腿骨
助骨部
中葉枝 In step S203, the character string filtering unit 103 executes such a filtering process. In this embodiment, from the character string extracted in step S202,
(1) Character strings that cannot basically be Japanese words (above (a) to (c)),
(2) A word already registered in the dictionary (above (d)),
(3) a character string having a high appearance frequency in the text stored in the text database 121 ((e) above),
Delete these strings. As a result of this filtering process, for example, the character string corresponding to the above (1) to (3) is deleted from the character string extracted in step S202, and the other character string is selected as the analysis target character string. become. Here, “a partial character string having an appearance frequency of 5 or less extracted from the latest anatomical glossary” is set as an analysis target character string set. For example, the following character strings are extracted.
(Analysis target character string set)
BIO
Lower leg bone

次にステップＳ２０４で、ステップＳ２０３において選択された解析対象文字列集合から最初の文字列を解析対象文字列として取得し、ステップＳ２０５において、先頭・末尾文字列抽出部１０４が、対象文字列の先頭および末尾のｍ文字を抽出する。ｍは１以上の予め設定された値である。
例えば、ｍ＝１で、選択した解析対象文字列が"下腿骨"だとすれば、先頭の"下"と末尾の"骨"を抽出する。 Next, in step S204, the first character string is acquired as the analysis target character string from the analysis target character string set selected in step S203. In step S205, the head / end character string extraction unit 104 performs the start of the target character string. And the last m characters are extracted. m is a preset value of 1 or more.
For example, if m = 1 and the selected character string to be analyzed is “lower leg bone”, the first “lower” and the last “bone” are extracted.

次のステップＳ２０６は、周辺文字列取得部１０５の処理であり、テキストデータベース１２１に格納されたテキストデータを対象とした解析処理により、対象文字列の周辺に現れる、ｍ文字の周辺文字列の集合を抽出する。
例えば、テキストデータベース１２１に格納されている医療テキスト集合から、
文字列"下腿骨"の周辺文字列を抽出する。ｍ＝１とした設定では、文字列"下腿骨"の左側の１文字、右側の１文字をそれぞれ抽出する。その結果として、
左側の周辺文字列："・""、"
これらの２種類の周辺文字列が得られ、
右側の周辺文字列として、"に""折"
これらの２種類の周辺文字列が得られる。 The next step S206 is a process of the peripheral character string acquisition unit 105, and a set of m character strings that appear around the target character string by the analysis process for the text data stored in the text database 121. To extract.
For example, from a medical text set stored in the text database 121,
Extract the surrounding character string of the character string "lower leg bone". With the setting of m = 1, one character on the left side and one character on the right side of the character string “crus bone” are extracted. As a result,
Left side string: "・"","
These two types of surrounding strings are obtained,
As the surrounding character string on the right side, “to” “fold”
These two types of surrounding character strings are obtained.

上記の処理結果は、具体的には、テキストデータベース１２１に格納されている医療テキスト集合に、
（ａ）「・・大腿・下腿骨に広範・・・」
（ｂ）「・・象で、下腿骨折の影・・・」
これらの文書が検出された場合の結果である。すなわち、
上記（ａ）から「下腿骨」の左側の１文字「・」、右側の１文字「に」、
上記（ｂ）から「下腿骨」の左側の１文字「、」、右側の１文字「折」、
これらが周辺文字列として抽出される。 Specifically, the above processing result is stored in a medical text set stored in the text database 121.
(A) "... extensive on thighs and lower leg bones ..."
(B) "... the elephant, the shadow of the broken leg ..."
This is the result when these documents are detected. That is,
From the above (a), one character “·” on the left side of “crus bone”, one character “ni” on the right side,
From (b) above, one letter “,” on the left side of “crus bone”, one letter “fold” on the right side,
These are extracted as surrounding character strings.

次のステップＳ２０７は、分岐数算出部（パープレキシティ算出部）１０６の処理であり、ステップＳ２０６の周辺文字列抽出処理において抽出された、解析対象文字列「下腿骨」を含む文書を構成する
解析対象文字列「下腿骨」の先頭・末尾文字列、および
解析対象文字列「下腿骨」の左右の周辺文字列
これらの文字列集合の、テキストデータベース１２１に格納されたテキストデータ内でのパープレキシティを計算する。すなわち、本処理例では、ステップＳ２０６における周辺文字列抽出処理において抽出された、解析対象文字列「下腿骨」を含む文書は、以下の２つである。
（ａ）「・・大腿・下腿骨に広範・・・」
（ｂ）「・・象で、下腿骨折の影・・・」 The next step S207 is a process of the branch number calculation unit (perplexity calculation unit) 106, and constitutes a document including the analysis target character string “lower leg bone” extracted in the peripheral character string extraction process of step S206. The first and last character strings of the character string to be analyzed “lower leg bone”, and the left and right surrounding character strings of the character string to be analyzed “lower leg bone”. The parsing of these character string sets in the text data stored in the text database 121 Calculate plexity. In other words, in the present processing example, there are the following two documents including the analysis target character string “crus bone” extracted in the surrounding character string extraction process in step S206.
(A) "... extensive on thighs and lower leg bones ..."
(B) "... the elephant, the shadow of the broken leg ..."

分岐数算出部（パープレキシティ算出部）１０６は、この２つの文書から、
解析対象文字列「下腿骨」の先頭・末尾文字列、および
解析対象文字列「下腿骨」の左右の周辺文字列
を選択して、それぞれのパープレキシティを算出する。具体的処理例について、図５、図６を参照して説明する。 From the two documents, the branch number calculation unit (perplexity calculation unit) 106
Select the first and last character strings of the analysis target character string “lower leg bone” and the left and right surrounding character strings of the analysis target character string “lower leg bone”, and calculate the perplexity of each. A specific processing example will be described with reference to FIGS.

図５に示すように、抽出文書、
（ａ）「・・大腿・下腿骨に広範・・・」
この文書から、
解析対象文字列「下腿骨」の先頭文字列［下］と、末尾文字列［骨］が選択され、
「下腿骨」の先頭文字列［下］の左側パープレキシティと、
「下腿骨」の末尾文字列［骨］の右側パープレキシティ、
これらを算出し、さらに、
「下腿骨」の左の周辺文字列［・］の右側パープレキシティと、
「下腿骨」の右の周辺文字列［に］の左側パープレキシティ、
これらを、テキストデータベース１２１に格納されたテキストを対象とした処理によって算出する。 As shown in FIG.
(A) "... extensive on thighs and lower leg bones ..."
From this document,
The first character string [lower] and the last character string [bone] of the analysis target character string “crus bone” are selected,
The left perplexity of the first character string [lower] of “lower leg bone”,
The right-side perplexity of the last character string [bone] of “lower leg bone”,
Calculate these, and
The right perplexity of the surrounding character string [•] on the left side of “lower leg bone”,
The left perplexity of the character string [to] to the right of “lower leg bone”,
These are calculated by processing for text stored in the text database 121.

パープレキシティは、先に［閾値設定処理］の説明の欄で説明したように、ある文字列［Ｗ］に対して左右に出現する文字列のエントロピーの値に基づいて算出され、文字列［Ｗ］に対して左右に出現する文字列の多様性を示す値であり、
ＰＰ（Ｗ_Ｌ）：文字列［Ｗ］に対する左の文字列のパープレキシティ、
ＰＰ（Ｗ_Ｒ）：文字列［Ｗ］に対する右の文字列のパープレキシティ、
である。 The perplexity is calculated based on the entropy value of the character string that appears to the left and right with respect to a certain character string [W], as described in the description section of [Threshold setting processing], and the character string [ W] is a value indicating the diversity of character strings appearing on the left and right,
PP (W _L ): Perplexity of the left character string with respect to the character string [W],
PP (W _R ): right string perplexity with respect to the string [W],
It is.

さらに、図６に示すように、もう１つの抽出文書、
（ｂ）「・・象で、下腿骨折の影・・・」
この文書についても同様に、パープレキシティ算出対象文字列を選択する。すでに、解析対象文字列「下腿骨」の先頭文字列［下］と、末尾文字列［骨］については選択済みであるので、
「下腿骨」の左の周辺文字列［、］の右側パープレキシティと、
「下腿骨」の右の周辺文字列［折］の左側パープレキシティ、
これらを、テキストデータベース１２１に格納されたテキストを対象とした処理によって算出する。 Furthermore, as shown in FIG. 6, another extracted document,
(B) "... the elephant, the shadow of the broken leg ..."
Similarly, for this document, a perplexity calculation target character string is selected. Since the first character string [lower] and the last character string [bone] of the analysis target character string “crus bone” have already been selected,
The right perplexity of the left peripheral character string [,] of “lower leg bone”,
The left perplexity of the surrounding character string [fold] on the right side of “lower leg bone”,
These are calculated by processing for text stored in the text database 121.

分岐数算出部（パープレキシティ算出部）１０６は、次の値を求める。
（１）解析対象文字列「下腿骨」の先頭文字列［下］の左側パープレキシティ
（２）解析対象文字列「下腿骨」の末尾文字列［骨］の右側パープレキシティ
（３）解析対象文字列「下腿骨」の左の周辺文字列［・］［、］の右側パープレキシティの平均値、
（４）解析対象文字列「下腿骨」の右の周辺文字列［に］［折］の左側パープレキシティの平均値、
これらの各値を算出する。 The branch number calculation unit (perplexity calculation unit) 106 obtains the following value.
(1) Left perplexity of the first character string [lower] of the character string to be analyzed “lower leg bone” (2) Right perplexity of the last character string [bone] of the character string to be analyzed “lower leg bone” (3) Analysis The average value of the right perplexity of the surrounding character string [•] [,] on the left of the target character string “lower leg bone”,
(4) The average value of the left-side perplexity of the right peripheral character string [ni] [fold] of the character string to be analyzed “lower leg bone”,
Each of these values is calculated.

次のステップＳ２０８は、計算されたパープレキシティから対象文字列のスコアを設定する。例えば、
（１）解析対象文字列の先頭ｍ文字の左側のパープレキシティａ、
（２）解析対象文字列の末尾ｍ文字の右側のパープレキシティｂ、
（３）解析対象文字列の左側ｍ文字の周辺文字列の集合の右側のパープレキシティの平均値ｃ、
（４）解析対象文字列の右側ｍ文字の周辺文字列の集合の左側のパープレキシティ平均値ｄ、
の４つの値の最小値を対象文字列のスコアとして設定する。このように設定することで分岐数の小さい境界を有する文字列であっても抽出することができる。
なお、このスコアの設定方法は一例であり、この他の手法を用いる構成としてもよい。例えば、上記４つの値ａ〜ｄの平均をスコアとして用いる構成としてもよく、この場合には同程度の境界を有する文字列が単語として抽出されるために正確な単語が抽出されやすくなる。 The next step S208 sets the score of the target character string from the calculated perplexity. For example,
(1) Perplexity a on the left side of the first m characters of the character string to be analyzed,
(2) Perplexity b on the right side of the last m characters of the character string to be analyzed,
(3) The average value c of the right perplexity of the set of surrounding character strings of the left m characters of the character string to be analyzed,
(4) Perplexity average value d on the left side of a set of surrounding character strings of m characters on the right side of the analysis target character string.
The minimum value of these four values is set as the score of the target character string. By setting in this way, even a character string having a boundary with a small number of branches can be extracted.
Note that this score setting method is merely an example, and other methods may be used. For example, an average of the above four values a to d may be used as a score. In this case, since a character string having a similar boundary is extracted as a word, an accurate word is easily extracted.

上記４つのパープレキシティａ〜ｄは、テキストデータベース１２１内での文字列の境界面でのパープレキシティを表現しており、テキストデータベース１２１にある程度の量のテキストが存在すれば、ｍの値の小さいｍ文字の文字列の出現頻度に関してはほとんどの場合、十分信頼性を得られる程度の計算を行うことができる。このため、４つのパープレキシティａ〜ｄに関しては、対象文字列の出現頻度が小さい場合でも、ほとんどの場合に信頼性のある値が得られる。 The above four perplexities a to d represent the perplexity on the boundary surface of the character string in the text database 121. If a certain amount of text exists in the text database 121, the value of m With regard to the appearance frequency of the character string of m characters having a small size, in most cases, it is possible to perform calculation to such an extent that sufficient reliability can be obtained. For this reason, regarding the four perplexities a to d, reliable values are obtained in most cases even when the appearance frequency of the target character string is small.

次のステップＳ２０９の単語判定処理は、単語判定部１１０の処理である。単語判定部１１０は、解析対象文字列のスコアをスコア設定部１０９から受領して、閾値設定部１０９から閾値［ｔ］を受領する。閾値設定部１０９から受領する閾値［ｔ］は、先に図３のフローチャートを参照して説明した閾値設定処理において設定した閾値［ｔ］である。 The word determination process in the next step S209 is a process of the word determination unit 110. The word determination unit 110 receives the score of the analysis target character string from the score setting unit 109 and receives the threshold value [t] from the threshold setting unit 109. The threshold value [t] received from the threshold value setting unit 109 is the threshold value [t] set in the threshold value setting process described above with reference to the flowchart of FIG.

単語判定部１１０は、スコア設定部１０９から受領した解析対象文字列のスコアが、閾値設定部１０９から受領する閾値ｔ以上であれば単語として認定する。すなわち、辞書登録のない未知語としての単語として認定する。例えば、
閾値ｔ＝１０．０
とした場合、上述した解析対象文字列「下腿骨」のテキストデータベース１２１を対象として実行したパープレキシティ算出に基づいて設定されたスコアが、
スコア＝３４．９５
とすると、
スコア：３４．９５≧１０．０（閾値）
上記式が成立するので、"下腿骨"を単語として認定する。 If the score of the character string to be analyzed received from the score setting unit 109 is greater than or equal to the threshold t received from the threshold setting unit 109, the word determination unit 110 recognizes it as a word. That is, it recognizes as an unknown word without dictionary registration. For example,
Threshold t = 10.0
In this case, the score set based on the perplexity calculation performed on the text database 121 of the analysis target character string “lower leg bone” described above is as follows:
Score = 34.95
Then,
Score: 34.95 ≧ 10.0 (threshold)
Since the above formula is established, “lower leg bone” is recognized as a word.

ステップＳ２１０では、未処理の解析対象文字列の有無を判定して、未処理の解析対象文字列がある場合は、次の解析対象文字列を選択して、ステップＳ２０５〜Ｓ２０９の処理を繰り返して実行し、その解析対象文字列の単語として認定するか否かを決定する。この処理をすべての解析対象文字列に対して実行し、未処理文字列が無くなった場合は処理を終了する。 In step S210, it is determined whether or not there is an unprocessed analysis target character string. If there is an unprocessed analysis target character string, the next analysis target character string is selected, and the processes in steps S205 to S209 are repeated. And determine whether or not to recognize as a word of the character string to be analyzed. This process is executed for all the character strings to be analyzed, and the process ends when there are no more unprocessed character strings.

なお、このようにして抽出された単語は、例えば形態素解析に適用する辞書データベースに登録するなどの利用が可能である。上述した処理例では医学分野の専門用語としての単語を抽出しているので、例えば医学分野の単語辞書や、形態素解析用の専門辞書の登録語として設定することができる。このようにして抽出した単語を登録した辞書を利用して形態素解析を行なうことで、従来の辞書では抽出できなかった単語の抽出が可能となる。また、上述の処理によって抽出された単語は、例えばデータ検索に利用する単語として利用することができる。このような処理によって医学分野の知識を持たないユーザであっても、上述した処理によって得られた医学用語である単語を利用した有効なデータ検索を行なうことが可能となる。こうして得られた辞書は、本方式で抽出された文字列のみを記憶装置に記憶させて他の辞書とは別個に管理されるデータベースとしてもよいし、先に述べたように他の辞書データベースに追加してもよい。また、その文字列が本方式により抽出されたことを示す情報や、さらにその文字列のスコアを文字列に紐付けして記憶させておくことも可能であり、その場合には抽出された文字列を言語処理に利用する場合に、これらの情報に基づいて文字列の使用、例えばその文字列の採否、表示態様の変更等、を制御することができる。 The words extracted in this way can be used, for example, by registering them in a dictionary database applied to morphological analysis. In the processing example described above, a word as a technical term in the medical field is extracted. Therefore, for example, it can be set as a registered word in a word dictionary in the medical field or a specialized dictionary for morphological analysis. By performing morphological analysis using a dictionary in which words extracted in this way are registered, it is possible to extract words that cannot be extracted by a conventional dictionary. Moreover, the word extracted by the above-mentioned process can be used as a word used for data search, for example. By such processing, even a user who does not have knowledge in the medical field can perform effective data retrieval using words that are medical terms obtained by the above-described processing. The dictionary obtained in this way may be a database that stores only the character strings extracted by this method in a storage device and is managed separately from other dictionaries, or may be stored in another dictionary database as described above. May be added. It is also possible to store information indicating that the character string has been extracted by this method, and further store the character string score in association with the character string. When a column is used for language processing, use of a character string, for example, acceptance / rejection of the character string, change of a display mode, and the like can be controlled based on such information.

［本実施例による単語抽出処理の評価］
上述した本発明の一実施例に従った単語抽出処理の結果についての評価処理を実行したので、その結果について以下説明する。
（ａ）上述した実施例に従った処理を行った結果と、
（ｂ）先に説明した非特許文献１（Ｓｈｉｍｏｈａｔａ，Ｓ．Ｓｕｇｉｏ，Ｔ．Ｎａｇａｔａ，Ｊ．Ｒｅｔｒｉｅｖｉｎｇｃｏｌｌｏｃａｔｉｏｎｓｂｙｃｏ−ｏｃｃｕｒｒｅｎｃｅｓａｎｄｗｏｒｄｏｒｄｅｒｃｏｎｓｔｒａｉｎｔｓ．Ｐｒｏｃ．ｏｆＡＣＬ／ＥＡＣＬ−９７）に開示された単語抽出処理、すなわち、コロケーション（連語）を抽出するにあたって、連語のコーパス内でのエントロピー（情報量）を計算し、両側の単語のエントロピーが設定した閾値を越える連語を抽出する処理を実行した結果、
これらを比較した。 [Evaluation of word extraction processing according to this embodiment]
Since the evaluation process was performed on the result of the word extraction process according to the above-described embodiment of the present invention, the result will be described below.
(A) The result of performing the processing according to the above-described embodiment;
(B) Non-Patent Document 1 described above (Shimohata, S. Sugio, T. Nagata, J. Retrieving collaborations by co-ocurrenses and word order constants. Proc. Processing, that is, in extracting collocation (collocation), the entropy (information amount) in the corpus of collocations is calculated, and as a result of executing the process of extracting collocations where the entropy of the words on both sides exceeds the set threshold,
These were compared.

処理対象としたのは、
（１）「最新解剖学用語集」、
（２）「ＭＥＤＩＳ標準病名マスター２．４．２」（ｈｔｔｐ：／／ｗｗｗ．ｍｅｄｉｓ．ｏｒ．ｊｐ／）
これらに含まれるテキストデータであり、これらのテキスト中の、約６５，０００件の実際の医療テキスト中での出現頻度が５以下の文字列に関して、
（ａ）上記実施例に従った処理を行い、スコア順に上位２００の文字列を抽出してＩＰＡ辞書に含まれるものを除いた結果Ａ、
（ｂ）上記先行技術（非特許文献１）に記載された方法を適用した処理によって抽出された文字列のスコア順に上位２００文字列を抽出してＩＰＡ辞書に含まれるものを除いた結果Ｂ、
これらの結果Ａと結果Ｂについて、単語として成立するものを専門家である医師が確認したところ、図７に示す評価結果が得られた。 The processing target was
(1) “Latest Anatomical Glossary”,
(2) “MEDIS Standard Disease Name Master 2.4.2” (http://www.medis.or.jp/)
It is text data included in these, and regarding the character string whose appearance frequency in about 65,000 actual medical texts in these texts is 5 or less,
(A) A result obtained by performing processing according to the above-described embodiment, extracting the top 200 character strings in the order of scores, and excluding those included in the IPA dictionary,
(B) Result B obtained by extracting the top 200 character strings in the order of the score of the character strings extracted by the process applying the method described in the above prior art (Non-patent Document 1) and excluding those included in the IPA dictionary,
Regarding these results A and B, when a doctor who is an expert confirmed what was established as a word, the evaluation results shown in FIG. 7 were obtained.

図７（１）は、「最新解剖学用語集」に含まれるテキストデータに基づく本実施例と、従来手法各々の単語抽出処理結果の比較データであり、図７（２）は、「ＭＥＤＩＳ標準病名マスター２．４．２」に含まれるテキストデータに基づく本実施例と、従来手法各々の単語抽出処理結果の比較データである。 FIG. 7 (1) is a comparison data of the word extraction processing results of the present embodiment based on the text data included in the “latest anatomical terminology” and the conventional method, and FIG. 7 (2) is a “MEDIAS standard”. It is the comparison data of the present Example based on the text data contained in "Sick name master 2.4.2", and the word extraction processing result of each conventional method.

「正答率」とは対象の文字列の内、単語として成立するものの割合である。ＲＲ_ＳＵＭはＲｅｃｉｐｒｏｃａｌＲａｎｋ（正解となった順位の逆数）の和であり、ランキングを考慮した評価指標である。図７から明らかなように、「最新解剖学用語集」に対する処理においては、本実施例に従った処理が、従来手法に比較して、
正答率で１６．８％、
ＲＲ_ＳＵＭで０．２９
これらの性能向上が見られ、
また、「標準病名マスター」に対する処理においては、本実施例に従った処理が、従来手法に比較して、
正答率で０．６％、
ＲＲ_ＳＵＭで０．５２
これらの性能向上が得られ、本発明の有効性が確認できた。 The “correct answer rate” is the proportion of the target character string that is established as a word. RR _SUM is the sum of Reciprocal Ranks (the reciprocal of the correct ranking), and is an evaluation index considering ranking. As is clear from FIG. 7, in the process for the “latest anatomy glossary”, the process according to the present embodiment is compared with the conventional method.
16.8% in correct answer rate,
0.29 for RR _SUM
These performance improvements are seen,
In addition, in the process for the “standard disease name master”, the process according to this example is compared with the conventional method.
0.6% correct answer rate,
0.52 for RR _SUM
These performance improvements were obtained, confirming the effectiveness of the present invention.

［実施例２］
次に、本発明の言語解析システムの実施例２の処理について説明する。実施例２の言語解析システムの実行する処理フローを図８に示す。実施例２においてもシステム構成は、図１を参照して説明したシステム構成が適用される。本実施例では、テキストデータベース１２１に含まれるテキスト単位での処理を実行する処理例である。例えば、図２（ａ）を参照して説明したようにテキストデータベース１２１には、様々なテキストが識別子（ＩＤ）に対応付けられて格納されている。本実施例では、これらの識別子の設定された各テキスト単位で単語抽出処理を実行する。各テキスト単位で単語（例えば未知語）を取得することで、各テキストに対応する単語情報を得ることが可能となる。さらに各テキストからの抽出単語のスコア順ランキング処理などが可能となる。 [Example 2]
Next, processing of the language analysis system according to the second embodiment of the present invention will be described. FIG. 8 shows a processing flow executed by the language analysis system according to the second embodiment. In the second embodiment, the system configuration described with reference to FIG. 1 is applied as the system configuration. The present embodiment is a processing example in which processing in units of text included in the text database 121 is executed. For example, as described with reference to FIG. 2A, the text database 121 stores various texts in association with identifiers (ID). In this embodiment, the word extraction process is executed for each text unit in which these identifiers are set. By obtaining a word (for example, an unknown word) for each text unit, it is possible to obtain word information corresponding to each text. Furthermore, the ranking process of the score of the extracted word from each text becomes possible.

図８に示すフローチャートを参照して本実施例の処理シーケンスを説明する。なお、図８に示す処理フロー中、ステップＳ３０２〜Ｓ３１１の処理は、先に図４を参照して説明した実施例１の処理フロー中のステップＳ２０１〜Ｓ２１０と同様の処理である。 The processing sequence of the present embodiment will be described with reference to the flowchart shown in FIG. In the process flow shown in FIG. 8, the processes in steps S302 to S311 are the same as the processes in steps S201 to S210 in the process flow of the first embodiment described above with reference to FIG.

本実施例では、まず、ステップＳ３０１において、テキストデータベース２１１から解析対象とする最初の１つのテキストを選択する。例えば、図２（ａ）に示す例では、例えば、解析対象テキストとして、最初の登録テキスト、すなわち、
ＩＤ１：両側肺野に優位な・・・
上記テキストを選択する。 In this embodiment, first, in step S301, the first one text to be analyzed is selected from the text database 211. For example, in the example shown in FIG. 2A, for example, as the analysis target text, the first registered text, that is,
ID1: Superior in bilateral lung fields
Select the text above.

次にステップＳ３０２において、選択テキストをシステムのテキスト入力部１０１に入力する。
次に、ステップＳ３０３において、文字列抽出部１０２が、入力されたテキストを改行・句点等の区切り記号で分割し、分割された各テキストから取りえる全ての部分文字列を抽出する。
次に、ステップＳ３０４において、文字列フィルタリング部１０３において、文字列フィルタリング処理を実行する。この処理は、先に実施例１において説明したように、
（１）基本的に日本語の単語にはなりえない文字列、
（２）既に辞書に登録されている単語、
（３）テキストデータベース１２１に格納されたテキスト内での出現頻度の大きい文字列、
これらの文字列を削除する処理として実行される。 In step S302, the selected text is input to the text input unit 101 of the system.
Next, in step S303, the character string extraction unit 102 divides the input text by delimiters such as line breaks and punctuation marks, and extracts all partial character strings that can be taken from each divided text.
Next, in step S304, the character string filtering unit 103 executes character string filtering processing. As described above in the first embodiment, this process is performed as follows.
(1) Character strings that cannot basically be Japanese words,
(2) words already registered in the dictionary,
(3) a character string having a high appearance frequency in the text stored in the text database 121;
This process is executed to delete these character strings.

次にステップＳ３０５で、ステップＳ３０４において選択された解析対象文字列集合から最初の文字列を解析対象文字列として取得し、ステップＳ３０６において、先頭・末尾文字列抽出部１０４が、対象文字列の先頭および末尾のｍ文字を抽出する。ｍは１以上の予め設定された値である。 Next, in step S305, the first character string is acquired as the analysis target character string from the analysis target character string set selected in step S304. In step S306, the head / end character string extraction unit 104 determines the head of the target character string. And the last m characters are extracted. m is a preset value of 1 or more.

次のステップＳ３０７は、周辺文字列取得部１０５の処理であり、テキストデータベース１２１に格納されたテキストデータを対象とした解析処理により、対象文字列の周辺に現れる、ｍ文字の周辺文字列の集合を抽出する。
次のステップＳ３０８は、分岐数算出部（パープレキシティ算出部）１０６の処理であり、ステップＳ３０７の周辺文字列抽出処理において抽出された、解析対象文字列を含む文書を構成する
解析対象文字列の先頭・末尾文字列、および
解析対象文字列の左右の周辺文字列、
これらの文字列集合の、テキストデータベース１２１に格納されたテキストデータ内でのパープレキシティを計算する。具体的には、
（１）解析対象文字列の先頭文字列の左側パープレキシティ
（２）解析対象文字列の末尾文字列の右側パープレキシティ
（３）解析対象文字列の左の周辺文字列の右側パープレキシティの平均値、
（４）解析対象文字列の右の周辺文字列の左側パープレキシティの平均値、
これらの各値を算出する。 The next step S307 is a process of the peripheral character string acquisition unit 105, and a set of m character string surrounding character strings appearing around the target character string by the analysis process for the text data stored in the text database 121. To extract.
The next step S308 is processing of the branch number calculation unit (perplexity calculation unit) 106, and constitutes a document including the analysis target character string extracted in the surrounding character string extraction processing of step S307. First and last character strings, and the left and right surrounding character strings of the analysis target character string,
The perplexity in the text data stored in the text database 121 of these character string sets is calculated. In particular,
(1) The left perplexity of the first character string of the analysis target character string (2) The right perplexity of the last character string of the analysis target character string (3) The right perplexity of the peripheral character string to the left of the analysis target character string The average value of
(4) The average value of the left perplexity of the surrounding character string to the right of the analysis target character string,
Each of these values is calculated.

次のステップＳ３０９は、スコア設定部１０８の処理である。スコア設定部１０８は、計算されたパープレキシティから対象文字列のスコアを設定する。例えば、
（１）解析対象文字列の先頭ｍ文字の左側のパープレキシティａ、
（２）解析対象文字列の末尾ｍ文字の右側のパープレキシティｂ、
（３）解析対象文字列の左側ｍ文字の周辺文字列の集合の右側のパープレキシティの平均値ｃ、
（４）解析対象文字列の右側ｍ文字の周辺文字列の集合の左側のパープレキシティ平均値ｄ、
これらの４つの値ａ〜ｄの最小値を対象文字列のスコアとして設定する。なお、先に図５、図６を参照して説明した処理例はｍ＝１の場合である。 The next step S309 is processing of the score setting unit 108. The score setting unit 108 sets the score of the target character string from the calculated perplexity. For example,
(1) Perplexity a on the left side of the first m characters of the character string to be analyzed,
(2) Perplexity b on the right side of the last m characters of the character string to be analyzed,
(3) The average value c of the right perplexity of the set of surrounding character strings of the left m characters of the character string to be analyzed,
(4) Perplexity average value d on the left side of a set of surrounding character strings of m characters on the right side of the analysis target character string.
The minimum value of these four values a to d is set as the score of the target character string. The processing example described above with reference to FIGS. 5 and 6 is a case where m = 1.

次のステップＳ３１０の単語判定処理は、単語判定部１１０の処理である。単語判定部１１０は、解析対象文字列のスコアをスコア設定部１０９から受領して、閾値設定部１０９から閾値［ｔ］を受領する。閾値設定部１０９から受領する閾値［ｔ］は、先に図３のフローチャートを参照して説明した閾値設定処理において設定した閾値［ｔ］である。 The word determination process in the next step S310 is a process of the word determination unit 110. The word determination unit 110 receives the score of the analysis target character string from the score setting unit 109 and receives the threshold value [t] from the threshold setting unit 109. The threshold value [t] received from the threshold value setting unit 109 is the threshold value [t] set in the threshold value setting process described above with reference to the flowchart of FIG.

単語判定部１１０は、スコア設定部１０９から受領した解析対象文字列のスコアが、閾値設定部１０９から受領する閾値ｔ以上であれば単語として認定する。すなわち、辞書登録のない未知語としての単語として認定する。 If the score of the character string to be analyzed received from the score setting unit 109 is greater than or equal to the threshold t received from the threshold setting unit 109, the word determination unit 110 recognizes it as a word. That is, it recognizes as an unknown word without dictionary registration.

次に、ステップＳ３１１では、未処理の解析対象文字列の有無を判定して、未処理の解析対象文字列がある場合は、次の解析対象文字列を選択して、ステップＳ３０６〜Ｓ３１０の処理を繰り返して実行し、その解析対象文字列の単語として認定するか否かを決定する。この処理をすべての解析対象文字列に対して実行し、未処理文字列が無くなった場合は、ステップＳ３１２に進む。 Next, in step S311, the presence / absence of an unprocessed analysis target character string is determined. If there is an unprocessed analysis target character string, the next analysis target character string is selected, and the processing in steps S306 to S310 is performed. Is repeatedly executed, and it is determined whether or not it is recognized as a word of the analysis target character string. This process is executed for all the character strings to be analyzed, and if there is no unprocessed character string, the process proceeds to step S312.

ステップＳ３１２では、未処理の解析対象テキストがテキストデータベース１２１にあるか否かを判定し、未処理の解析対象テキストがテキストデータベース１２１にある場合は、次の解析対象テキストを選択して、ステップＳ３０２〜Ｓ３１１の処理を繰り返して実行する。この処理をすべての解析対象テキストに対して実行し、未処理テキストが無くなった場合は、処理を終了する。 In step S312, it is determined whether or not unprocessed analysis target text exists in the text database 121. If the unprocessed analysis target text exists in the text database 121, the next analysis target text is selected, and step S302 is performed. ˜S311 is repeatedly executed. This process is executed for all the texts to be analyzed, and when there is no unprocessed text, the process ends.

本処理例では、テキストデータベース１２１に格納された識別子の設定された各テキスト単位で単語抽出処理を実行する。このように、各テキスト単位で単語（例えば未知語）を取得することで、各テキストに対応する単語情報を得ることが可能となる。さらに各テキストからの抽出単語のスコア順ランキング処理などが可能となる。 In this processing example, word extraction processing is executed for each text unit in which identifiers stored in the text database 121 are set. Thus, by acquiring a word (for example, an unknown word) for each text unit, it is possible to obtain word information corresponding to each text. Furthermore, the ranking process of the score of the extracted word from each text becomes possible.

なお、本処理例においても、抽出された単語は、例えば専門分野の単語辞書や、形態素解析用の専門辞書の登録語として設定することができる。このようにして抽出した単語を登録した辞書を利用して形態素解析を行なうことで、従来の辞書では抽出できなかった単語の抽出が可能となり、また、抽出単語をデータ検索に利用する単語として利用することができる。このような処理によって専門分野の知識を持たないユーザであっても、上述した処理によって得られた専門用語である単語を利用した有効なデータ検索を行なうことが可能となる。 Also in this processing example, the extracted words can be set, for example, as registered words in a specialized word dictionary or a specialized dictionary for morphological analysis. By performing morphological analysis using a dictionary in which extracted words are registered in this way, it becomes possible to extract words that could not be extracted by conventional dictionaries, and use extracted words as words used for data retrieval can do. By such processing, even a user who does not have knowledge in a specialized field can perform effective data retrieval using words that are technical terms obtained by the above-described processing.

［実施例３］
次に、図９を参照して本発明の言語解析システムの実施例３について説明する。上述した実施例１，２では、例えば図１のシステム構成における先頭・末尾文字列抽出部１０４が、解析対象文字列を含む文書を構成する以下のデータ、すなわち、
解析対象文字列の先頭・末尾のｍ文字の文字列、および、
解析対象文字列の左右の周辺のｍ文字の文字列、
これらのｍ文字の文字列集合に対して、テキストデータベース１２１に格納されたテキストデータ内でのパープレキシティを計算する構成としていた。 [Example 3]
Next, Embodiment 3 of the language analysis system of the present invention will be described with reference to FIG. In the first and second embodiments described above, for example, the head / end character string extraction unit 104 in the system configuration of FIG. 1 includes the following data that forms a document including the analysis target character string:
A string of m characters at the beginning and end of the string to be analyzed, and
A string of m characters around the left and right of the string to be analyzed,
The perplexity in the text data stored in the text database 121 is calculated for these m character string sets.

すなわち、上述した実施例１，２では、
（１）解析対象文字列の先頭のｍ文字の文字列の左側パープレキシティ
（２）解析対象文字列の末尾のｍ文字の文字列の右側パープレキシティ
（３）解析対象文字列の左の周辺ｍ文字の文字列の右側パープレキシティの平均値、
（４）解析対象文字列の右の周辺ｍ文字の文字列の左側パープレキシティの平均値、
これらの各値を算出する構成としていた。 That is, in the first and second embodiments described above,
(1) Left-side perplexity of the first m characters of the character string to be analyzed (2) Right-side perplexity of the last m characters of the character string to be analyzed (3) Left of the character string to be analyzed The average value of the right perplexity of the surrounding m character string,
(4) The average value of the left perplexity of the character string of the surrounding m characters to the right of the analysis target character string,
These values were calculated.

実施例３では、図１に示す先頭・末尾文字列抽出部１０４を図９に示すように部分文字列抽出部３０１に置きかえている。部分文字列抽出部３０１では、解析対象文字列の先頭、末尾の文字列のみならず、解析対象文字列を構成するｍ文字の部分文字列をすべて抽出して、抽出したｍ文字の部分文字列に対応するパープレキシティを算出する。 In the third embodiment, the head / end character string extraction unit 104 shown in FIG. 1 is replaced with a partial character string extraction unit 301 as shown in FIG. The partial character string extraction unit 301 extracts not only the first and last character strings of the analysis target character string but also all the m character partial character strings constituting the analysis target character string, and extracts the extracted m character partial character strings. The perplexity corresponding to is calculated.

先の実施例１，２では、例えばｍ＝１とした場合、先頭・末尾文字列抽出部１０４が、対象文字列の先頭および末尾のｍ＝１文字を抽出し、解析対象文字列が"下腿骨"だとすれば、先頭の"下"と末尾の"骨"を抽出していた。
これに対して本実施例では、ｍ＝１とした場合、部分文字列抽出部３０１が、対象文字列のｍ＝１文字からなるすべての部分文字列を抽出する。解析対象文字列が"下腿骨"だとすれば、先頭の"下"と末尾の"骨"、さらに、中央の"腿"を抽出する。 In the first and second embodiments, when m = 1, for example, the head / end character string extraction unit 104 extracts the first and last m = 1 characters of the target character string, and the analysis target character string is “lower leg”. If it was “bone”, the first “bottom” and the last “bone” were extracted.
On the other hand, in this embodiment, when m = 1, the partial character string extraction unit 301 extracts all partial character strings made up of m = 1 character of the target character string. If the character string to be analyzed is “lower leg bone”, the first “lower”, the last “bone”, and the center “thigh” are extracted.

分岐数算出部（パープレキシティ算出部）１０６では、部分文字列抽出部３０１が抽出した文字列に対応するパープレキシティを算出する。例えば、ｍ＝１で、解析対象文字列が"下腿骨"の場合、先に実施例１において図５、図６を参照して説明した解析対象文字列の境界におけるパープレキシティを算出するとともに、図１０に示すように、
（ａ）"下" の右側のパープレキシティ、
（ｂ）"腿"の右側および左側のパープレキシティ、
（ｃ）"骨"の左側のパープレキシティ、
これらのパープレキシティについても算出する。 The branch number calculation unit (perplexity calculation unit) 106 calculates perplexity corresponding to the character string extracted by the partial character string extraction unit 301. For example, when m = 1 and the character string to be analyzed is “crus bone”, the perplexity at the boundary of the character string to be analyzed described in the first embodiment with reference to FIGS. 5 and 6 is calculated. As shown in FIG.
(A) Perplexity on the right side of “below”,
(B) The right and left perplexity of the “thigh”,
(C) Perplexity on the left side of “bone”,
These perplexities are also calculated.

さらに、内部スコア設定部３０２では、
（１）"下"と"腿"の右側のパープレキシティの平均、
（２）"腿"と"骨"の左側のパープレキシティの平均、
これらの平均値をそれぞれ計算し、これらの平均値の最大値を内部スコアとして設定する。 Furthermore, in the internal score setting unit 302,
(1) The average perplexity on the right side of “bottom” and “thigh”,
(2) The average perplexity on the left of “thigh” and “bone”
Each of these average values is calculated, and the maximum value of these average values is set as the internal score.

スコア設定部１０８では、先の実施例１と同様に、図５、図６を参照して説明した解析対象文字列の境界におけるパープレキシティ、すなわち、
（１）解析対象文字列の先頭ｍ文字の左側のパープレキシティａ、
（２）解析対象文字列の末尾ｍ文字の右側のパープレキシティｂ、
（３）解析対象文字列の左側ｍ文字の周辺文字列の集合の右側のパープレキシティの平均値ｃ、
（４）解析対象文字列の右側ｍ文字の周辺文字列の集合の左側のパープレキシティ平均値ｄ、
これらの４つの値ａ〜ｄの最小値を対象文字列の境界スコア［Ｓ］として設定する。なお、先に図５、図６を参照して説明した処理例はｍ＝１の場合である。 In the score setting unit 108, as in the first embodiment, the perplexity at the boundary of the analysis target character string described with reference to FIG. 5 and FIG.
(1) Perplexity a on the left side of the first m characters of the character string to be analyzed,
(2) Perplexity b on the right side of the last m characters of the character string to be analyzed,
(3) The average value c of the right perplexity of the set of surrounding character strings of the left m characters of the character string to be analyzed,
(4) Perplexity average value d on the left side of a set of surrounding character strings of m characters on the right side of the analysis target character string.
The minimum value of these four values a to d is set as the boundary score [S] of the target character string. The processing example described above with reference to FIGS. 5 and 6 is a case where m = 1.

単語判定部１１０は、スコア設定部１０８で設定された境界スコア［Ｓ］と、内部スコア設定部３０２で設定された内部スコア［ＳＩＮ］を比較し、境界スコア［Ｓ］が内部スコア［ＳＩＮ］より大きい場合は、単語として認定する。すなわち、辞書登録のない未知語としての単語として認定する。 The word determination unit 110 compares the boundary score [S] set by the score setting unit 108 with the internal score [SIN] set by the internal score setting unit 302, and the boundary score [S] is the internal score [SIN]. If it is larger, it is recognized as a word. That is, it recognizes as an unknown word without dictionary registration.

本構成によれば、解析対象文字列の部分文字列を抽出して内部スコア［ＳＩＮ］を計算し、境界面スコア［Ｓ］と比較することにより、対象文字列の特性をより強く反映した未知語の抽出が可能となる。 According to this configuration, the partial character string of the analysis target character string is extracted, the internal score [SIN] is calculated, and compared with the boundary surface score [S], the unknown that more strongly reflects the characteristics of the target character string It is possible to extract words.

なお、上述の実施例では、部分文字列抽出部３０１の処理として、ｍ＝１として解析対象文字列を構成するｍ＝１文字の部分文字列をすべて抽出して、抽出したｍ＝１文字の部分文字列に対応するパープレキシティを算出する構成としたが、ｍは１以上の様々な値に設定することができる。例えば文字の種類（漢字、ひらがな、カタカナ、英数字など）に応じてｍの値を変更する構成とすることができる。このように、文字種によってｍの値を変更することにより、文字種による情報量の違いに対応できる。例えば、漢字ではｍ＝１、ひらがな・カタカナ・英数字ではｍ＝２にすることにより、漢字の情報量を平仮名・片仮名・英数字の２倍として扱うことができる。なお、解析対象文字列の先頭、末尾、左側、右側におけるそれぞれ抽出する文字数ｍを異ならせてもよい。ただし、これらの評価値を組みあせてスコアを設定する場合には、同じ文字数としないと条件の異なる評価値が組み合わされてしまうので、同じとすることが望ましい。 In the above-described embodiment, as the processing of the partial character string extraction unit 301, m = 1 character partial character strings constituting the analysis target character string are extracted as m = 1, and the extracted m = 1 character character is extracted. Although the perplexity corresponding to the partial character string is calculated, m can be set to various values of 1 or more. For example, the value of m can be changed according to the type of character (kanji, hiragana, katakana, alphanumeric, etc.). Thus, by changing the value of m depending on the character type, it is possible to cope with the difference in information amount depending on the character type. For example, by setting m = 1 for kanji and m = 2 for hiragana / katakana / alphanumeric characters, the amount of information of kanji can be handled as twice that of hiragana / katakana / alphanumeric characters. Note that the number of characters m to be extracted at the beginning, end, left side, and right side of the analysis target character string may be different. However, when the scores are set by combining these evaluation values, since the evaluation values under different conditions are combined unless the same number of characters is used, it is desirable that they be the same.

また、上述の実施例では、解析対象文字列の境界における出現文字列の分岐状態を表す評価値として、数２で示すパープレキシティ（平均分岐数）を用いる例を述べたが、出現文字列の多様性の評価値としては、たとえば、言語解析の評価指標として用いられている、エントロピー（数１）を用いることもでき、またシステムの動作に適するように、これらの数式を所定倍、規格化、所定の変換式を用いて演算することで、評価値として用いることもできる。あるいは対象文字列の前後での文字列の出現回数に基づく多様性の評価値を所定の演算式を施すことで求め、この演算結果を評価値として使用することもできる。ただし、パープレキシティは、多様性を示す値として言語処理の分野では慣用されていることから、エントロピーを指標として用いる場合に比べて、単語抽出する場合の条件設定や調整において妥当な設定を行いやすく、システムとして運用がしやすくなる。 Further, in the above-described embodiment, the example in which the perplexity (average branch number) represented by Equation 2 is used as the evaluation value indicating the branch state of the appearance character string at the boundary of the analysis target character string has been described. As the evaluation value of diversity, for example, entropy (Equation 1), which is used as an evaluation index for language analysis, can also be used. By calculating using a predetermined conversion formula, it can be used as an evaluation value. Alternatively, the evaluation value of diversity based on the number of appearances of the character string before and after the target character string can be obtained by applying a predetermined arithmetic expression, and the calculation result can be used as the evaluation value. However, since perplexity is commonly used in the field of language processing as a value indicating diversity, the perplexity is set more appropriately in condition setting and adjustment when extracting words than when entropy is used as an index. Easy to operate as a system.

最後に、上述した処理を実行する言語解析システムを構成する情報処理装置のハードウェア構成例について、図１１を参照して説明する。ＣＰＵ（Central Processing Unit）５０１は、ＯＳ（Operating System)に対応する処理や、上述の実施例において説明したパープレキシティ算出処理、スコア算出処理、単語抽出処理、抽出した単語を登録した辞書の生成処理、生成した辞書を適用した形態素解析処理などを実行する。これらの処理は、各情報処理装置のＲＯＭ、ハードディスクなどのデータ記憶部に格納されたコンピュータ・プログラムに従って実行される。 Finally, a hardware configuration example of the information processing apparatus that configures the language analysis system that executes the above-described processing will be described with reference to FIG. A CPU (Central Processing Unit) 501 performs processing corresponding to an OS (Operating System), perplexity calculation processing, score calculation processing, word extraction processing described in the above-described embodiment, and generation of a dictionary in which extracted words are registered. Processing, morphological analysis processing using the generated dictionary, etc. are executed. These processes are executed according to a computer program stored in a data storage unit such as a ROM or a hard disk of each information processing apparatus.

ＲＯＭ（Read Only Memory）５０２は、ＣＰＵ５０１が使用するプログラムや演算パラメータ等を格納する。ＲＡＭ（Random Access Memory）５０３は、ＣＰＵ５０１の実行において使用するプログラムや、その実行において適宜変化するパラメータ等を格納する。これらはＣＰＵバスなどから構成されるホストバス５０４により相互に接続されている。 A ROM (Read Only Memory) 502 stores programs used by the CPU 501, calculation parameters, and the like. A RAM (Random Access Memory) 503 stores programs used in the execution of the CPU 501, parameters that change as appropriate during the execution, and the like. These are connected to each other by a host bus 504 including a CPU bus.

ホストバス５０４は、ブリッジ５０５を介して、ＰＣＩ(Peripheral Component Interconnect/Interface)バスなどの外部バス５０６に接続されている。 The host bus 504 is connected to an external bus 506 such as a PCI (Peripheral Component Interconnect / Interface) bus via a bridge 505.

キーボード５０８、ポインティングデバイス５０９は、ユーザにより操作される入力デバイスである。ディスプレイ５１０は、液晶表示装置またはＣＲＴ（Cathode Ray Tube）などから成り、各種情報をテキストやイメージで表示する。 A keyboard 508 and a pointing device 509 are input devices operated by the user. The display 510 includes a liquid crystal display device, a CRT (Cathode Ray Tube), or the like, and displays various types of information as text and images.

ＨＤＤ（Hard Disk Drive）５１１は、ハードディスクを内蔵し、ハードディスクを駆動し、ＣＰＵ５０１によって実行するプログラムや情報を記録または再生させる。ハードディスクは、例えば上述の実施例において説明した単語抽出によって抽出された単語や、単語を登録した辞書、さらに、上述の実施例において適用するデータ処理プログラム、閾値、スコア等のパラメータや、各種コンピュータ・プログラムが格納される。 An HDD (Hard Disk Drive) 511 includes a hard disk, drives the hard disk, and records or reproduces a program executed by the CPU 501 and information. The hard disk includes, for example, words extracted by the word extraction described in the above embodiments, a dictionary in which words are registered, data processing programs applied in the above embodiments, parameters such as threshold values and scores, Stores the program.

ドライブ５１２は、装着されている磁気ディスク、光ディスク、光磁気ディスク、または半導体メモリ等のリムーバブル記録媒体５２１に記録されているデータまたはプログラムを読み出して、そのデータまたはプログラムを、インタフェース５０７、外部バス５０６、ブリッジ５０５、およびホストバス５０４を介して接続されているＲＡＭ５０３に供給する。 The drive 512 reads data or a program recorded on a removable recording medium 521 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, and the data or program is read out from the interface 507 and the external bus 506. , And supplied to the RAM 503 connected via the bridge 505 and the host bus 504.

接続ポート５１４は、外部接続機器５２２を接続するポートであり、ＵＳＢ，ＩＥＥＥ１３９４等の接続部を持つ。接続ポート５１４は、インタフェース５０７、および外部バス５０６、ブリッジ５０５、ホストバス５０４等を介してＣＰＵ５０１等に接続されている。通信部５１５は、ネットワークに接続され、各種データベースや他の情報処理装置との通信を実行する。 The connection port 514 is a port for connecting the external connection device 522 and has a connection unit such as USB or IEEE1394. The connection port 514 is connected to the CPU 501 and the like via the interface 507, the external bus 506, the bridge 505, the host bus 504, and the like. The communication unit 515 is connected to a network and executes communication with various databases and other information processing apparatuses.

なお、図１１に示す言語解析システムとしての情報処理装置のハードウェア構成例は、ＰＣを適用して構成した装置の一例であり、本発明の言語解析システムは、図１１に示す構成に限らず、上述した実施例において説明した処理を実行可能な構成であればよい。 Note that the hardware configuration example of the information processing apparatus as the language analysis system shown in FIG. 11 is an example of an apparatus configured by applying a PC, and the language analysis system of the present invention is not limited to the configuration shown in FIG. Any configuration capable of executing the processing described in the above-described embodiments may be used.

以上、特定の実施例を参照しながら、本発明について詳解してきた。しかしながら、本発明の要旨を逸脱しない範囲で当業者が該実施例の修正や代用を成し得ることは自明である。すなわち、例示という形態で本発明を開示してきたのであり、限定的に解釈されるべきではない。本発明の要旨を判断するためには、特許請求の範囲の欄を参酌すべきである。 The present invention has been described in detail above with reference to specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiments without departing from the gist of the present invention. In other words, the present invention has been disclosed in the form of exemplification, and should not be interpreted in a limited manner. In order to determine the gist of the present invention, the claims should be taken into consideration.

なお、明細書中において説明した一連の処理はハードウェア、またはソフトウェア、あるいは両者の複合構成によって実行することが可能である。ソフトウェアによる処理を実行する場合は、処理シーケンスを記録したプログラムを、専用のハードウェアに組み込まれたコンピュータ内のメモリにインストールして実行させるか、あるいは、各種処理が実行可能な汎用コンピュータにプログラムをインストールして実行させることが可能である。 The series of processes described in the specification can be executed by hardware, software, or a combined configuration of both. When executing processing by software, the program recording the processing sequence is installed in a memory in a computer incorporated in dedicated hardware and executed, or the program is executed on a general-purpose computer capable of executing various processing. It can be installed and run.

例えば、プログラムは記録媒体としてのハードディスクやＲＯＭ（Read Only Memory)に予め記録しておくことができる。あるいは、プログラムはフレキシブルディスク、ＣＤ−ＲＯＭ(Compact Disc Read Only Memory)，ＭＯ(Magneto optical)ディスク，ＤＶＤ(Digital Versatile Disc)、磁気ディスク、半導体メモリなどのリムーバブル記録媒体に、一時的あるいは永続的に格納（記録）しておくことができる。このようなリムーバブル記録媒体は、いわゆるパッケージソフトウエアとして提供することができる。 For example, the program can be recorded in advance on a hard disk or ROM (Read Only Memory) as a recording medium. Alternatively, the program is temporarily or permanently stored on a removable recording medium such as a flexible disk, a CD-ROM (Compact Disc Read Only Memory), an MO (Magneto optical) disk, a DVD (Digital Versatile Disc), a magnetic disk, or a semiconductor memory. It can be stored (recorded). Such a removable recording medium can be provided as so-called package software.

なお、プログラムは、上述したようなリムーバブル記録媒体からコンピュータにインストールする他、ダウンロードサイトから、コンピュータに無線転送したり、ＬＡＮ(Local Area Network)、インターネットといったネットワークを介して、コンピュータに有線で転送し、コンピュータでは、そのようにして転送されてくるプログラムを受信し、内蔵するハードディスク等の記録媒体にインストールすることができる。 The program is installed on the computer from the removable recording medium as described above, or is wirelessly transferred from the download site to the computer, or is wired to the computer via a network such as a LAN (Local Area Network) or the Internet. The computer can receive the program transferred in this manner and install it on a recording medium such as a built-in hard disk.

なお、明細書に記載された各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。また、本明細書においてシステムとは、複数の装置の論理的集合構成であり、各構成の装置が同一筐体内にあるものには限らない。 Note that the various processes described in the specification are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Further, in this specification, the system is a logical set configuration of a plurality of devices, and the devices of each configuration are not limited to being in the same casing.

本発明の一実施例に係る言語解析システムの構成例を示す図である。It is a figure which shows the structural example of the language analysis system which concerns on one Example of this invention. 本発明の一実施例に係る言語解析システムにおいて適用されるテキストデータベースと、単語データベースの構成例について説明する図である。It is a figure explaining the example of a structure of the text database applied in the language analysis system which concerns on one Example of this invention, and a word database. 本発明の一実施例に係る言語解析システムにおいて実行する閾値設定処理のシーケンスを説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining the sequence of the threshold value setting process performed in the language analysis system which concerns on one Example of this invention. 本発明の一実施例に係る言語解析システムにおいて実行する単語抽出処理のシーケンスを説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining the sequence of the word extraction process performed in the language analysis system which concerns on one Example of this invention. 本発明の一実施例に係る言語解析システムにおいて実行するパープレキシティの算出処理例について説明する図である。It is a figure explaining the example of a perplexity calculation process performed in the language analysis system which concerns on one Example of this invention. 本発明の一実施例に係る言語解析システムにおいて実行するパープレキシティの算出処理例について説明する図である。It is a figure explaining the example of a perplexity calculation process performed in the language analysis system which concerns on one Example of this invention. 本発明の一実施例に係る言語解析システムにおいて実行された単語抽出結果の評価データについて説明する図である。It is a figure explaining the evaluation data of the word extraction result performed in the language analysis system which concerns on one Example of this invention. 本発明の一実施例に係る言語解析システムにおいて実行する閾値設定処理のシーケンスを説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining the sequence of the threshold value setting process performed in the language analysis system which concerns on one Example of this invention. 本発明の一実施例に係る言語解析システムの構成例を示す図である。It is a figure which shows the structural example of the language analysis system which concerns on one Example of this invention. 本発明の一実施例に係る言語解析システムにおいて実行するパープレキシティおよび内部スコアの算出処理例について説明する図である。It is a figure explaining the calculation process example of the perplexity and internal score performed in the language analysis system which concerns on one Example of this invention. 本発明の一実施形態に係る言語解析システムのハードウェア構成例について説明する図である。It is a figure explaining the hardware structural example of the language analysis system which concerns on one Embodiment of this invention.

Explanation of symbols

１００言語解析システム
１０１テキスト入力部
１０２文字列抽出部
１０３文字列フィルタ部
１０４先頭・末尾文字列抽出部
１０５周辺文字列抽出部
１０６分岐数算出部（パープレキシティ算出部）
１０７単語分割部
１０８スコア設定部
１０９閾値設定部
１１０単語判定部
１２１テキストデータベース
１２２単語データベース
３０１部分文字列抽出部
３０２内部スコア設定部
５０１ＣＰＵ(Central Processing Unit)
５０２ＲＯＭ（Read-Only-Memory）
５０３ＲＡＭ（Random Access Memory）
５０４ホストバス
５０５ブリッジ
５０６外部バス
５０７インタフェース
５０８キーボード
５０９ポインティングデバイス
５１０ディスプレイ
５１１ＨＤＤ（Hard Disk Drive）
５１２ドライブ
５１４接続ポート
５１５通信部
５２１リムーバブル記録媒体
５２２外部接続機器 DESCRIPTION OF SYMBOLS 100 Language analysis system 101 Text input part 102 Character string extraction part 103 Character string filter part 104 Leading / ending character string extraction part 105 Peripheral character string extraction part 106 Branch number calculation part (perplexity calculation part)
DESCRIPTION OF SYMBOLS 107 Word division part 108 Score setting part 109 Threshold setting part 110 Word determination part 121 Text database 122 Word database 301 Partial character string extraction part 302 Internal score setting part 501 CPU (Central Processing Unit)
502 ROM (Read-Only-Memory)
503 RAM (Random Access Memory)
504 Host bus 505 Bridge 506 External bus 507 Interface 508 Keyboard 509 Pointing device 510 Display 511 HDD (Hard Disk Drive)
512 drive 514 connection port 515 communication unit 521 removable recording medium 522 external connection device

Claims

A character string extraction unit that extracts a set of character strings having a predetermined number of characters or less from text data;
From the text data in a text database, a character string where the character string extraction unit and extracted as an analysis target character string, the parsed character string word based on perplexity occurred string at the boundary of the analyzed text A score setting unit that calculates and sets a score indicating the probability of being ,
A word determination unit that determines whether or not the analysis target character string is a word based on the score set by the score setting unit;
A language analysis system comprising:

The score setting unit
The perplexity of the appearance character string in the outward direction of the analysis target character string in each of one or more characters at both ends of the analysis target character string;
The perplexity of the appearance character string in the direction of the analysis target character string in each of one or more characters at both ends in contact with the analysis target character string;
Is calculated,
The language analysis system according to claim 1, wherein a score is calculated based on the two types of calculated perplexities .

The score setting unit
The language analysis system according to claim 2, wherein a score is calculated based on a minimum value of the two types of perplexities .

The language analysis system further includes:
A word dividing unit that divides words registered in the word database into units of m characters (however, a predetermined number of m ≧ 1);
The perplexity of the occurred string at the boundary of the parsed character string corresponding to the character string of the divided m characters, of the previous SL word dividing section and calculated, respectively and the calculated from perplexity, the m character units A threshold value setting unit that calculates an average value for a plurality of perplexities excluding the perplexity of the appearance character string in the external direction at the end of the character string, and calculates the average value as a threshold value,
The word determination unit
Comparing the score set by the score setting unit with the threshold set by the threshold setting unit, and determining that the analysis target character string is a word according to the comparison result The language analysis system according to claim 1, wherein:

The language analysis system further includes:
A character string filter unit that executes a filtering process to delete a character string that does not hold as a word from the set of character strings extracted by the character string extraction unit;
The score setting unit
The language analysis system according to claim 1, wherein the score is calculated and set using the character string after filtering in the character string filter unit as an analysis target character string.

The language analysis system further includes:
From the extracted character string extracted by the character string extracting unit, a leading and trailing character string extracting unit that extracts the leading and trailing character strings of the extracted character string,
The score setting unit
2. The configuration according to claim 1, wherein the perplexity of an appearance character string at a boundary of the analysis target character string is calculated based on the head / end character string extracted by the head / end character string extraction unit. The language analysis system described.

The language analysis system further includes:
A partial character string extraction unit that extracts a partial character string of the extracted character string from the extracted character string extracted by the character string extraction unit ;
Based on the partial character string extracted by the partial character string extraction unit , the calculation of the perplexity of the appearance character string within the analysis target character string is performed , and an internal score is set based on the perplexity A score setting unit, the word determination unit,
It is configured to execute a process of determining whether or not the analysis target character string is a word based on the score set by the score setting unit and the internal score set by the internal score setting unit. The language analysis system according to claim 1.

The language analysis system includes:
The language analysis system according to claim 1, wherein the word extraction process is executed in units of text stored in a text database.

The language analysis system according to any one of claims 1 to 8, wherein the analysis target character string is registered in a dictionary database for morphological analysis based on a determination result of the word determination unit.

A language analysis method for executing language analysis processing in a language analysis system,
A character string extraction step in which the character string extraction unit extracts a set of character strings having a predetermined number of characters or less from the text data;
Score setting unit, the analysis processing for the text data in the text database, a character string extracted in the character string extraction step as an analysis target character string, the occurred string at the boundary of the parsed string Papurekishi A score setting step for calculating and setting a score indicating the probability that the analysis target character string is a word based on the tee ;
A word determination step in which a word determination unit determines whether or not the analysis target character string is a word based on the score set in the score setting step;
A language analysis method characterized by comprising:

A computer program that executes language analysis processing in a language analysis system,
A character string extraction step for causing the character string extraction unit to extract a set of character strings having a predetermined number of characters or less from the text data;
The score setting unit, the analysis processing for the text data in the text database, a character string extracted in the character string extraction step as an analysis target character string, the occurred string at the boundary of the parsed string Papurekishi A score setting step for calculating and setting a score indicating the probability that the analysis target character string is a word based on the tee ;
A word determination step for causing a word determination unit to determine whether or not the analysis target character string is a word based on the score set in the score setting step;
A computer program for executing