JP2008146218A

JP2008146218A - Language analysis system, language analysis method and computer program

Info

Publication number: JP2008146218A
Application number: JP2006330637A
Authority: JP
Inventors: Hiroshi Masuichi; 博増市
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2006-12-07
Filing date: 2006-12-07
Publication date: 2008-06-26
Anticipated expiration: 2026-12-07
Also published as: JP5125083B2

Abstract

<P>PROBLEM TO BE SOLVED: To generate a morpheme-analyzing dictionary to perform accurate morpheme analysis by accurately dividing a technical term difficult in division point setting to extract a morpheme. <P>SOLUTION: A parallel translation pair having registration data wherein the parallel translation pair of Japanese and a foreign language is one word-to-multiple words different from one word-to-one word is extracted from registration data of a bilingual dictionary of Japanese and the foreign language, Japanese of the extracted parallel translation pair is performed with the morpheme analysis, a foreign word corresponding to a generated divided word or divided word string is specified, a divided word or a divided word string corresponding to the foreign word is registered in the morpheme-analyzing dictionary as the morpheme, and the morpheme analysis based on registered morpheme information is performed. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、言語解析システム、および言語解析方法、並びにコンピュータ・プログラムに関する。さらに詳細には、例えば医療分野などの専門性の高い分野において用いられる用語をデータ検索などに適用するための言語単位としての区分語（形態素）に区切る処理を的確に行なう言語解析システム、および言語解析方法、並びにコンピュータ・プログラムに関する。 The present invention relates to a language analysis system, a language analysis method, and a computer program. More specifically, for example, a language analysis system that accurately performs a process of dividing a term used in a highly specialized field such as a medical field into a segment word (morpheme) as a language unit for applying to data search and the like, and a language The present invention relates to an analysis method and a computer program.

例えばデータベース検索などにおいて適用する検索キーや、用語辞書の索引としてのインデックスの設定など、データ処理において適用する用語を自然言語の文書から抽出する処理は、様々なデータ処理分野において必要となる技術である。検索処理における検索キーなど、データ処理に適用する言語単位は形態素と呼ばれる。 For example, a process for extracting a term applied in data processing from a natural language document such as a search key applied in a database search or an index as an index of a term dictionary is a technique required in various data processing fields. is there. A language unit applied to data processing such as a search key in search processing is called a morpheme.

自然言語の文書から形態素を抽出する研究は従来から行われている。例えば、［車が道路を走る］といったありふれた文書であれば、一般的な形態素解析システムを適用することで、［車］、［道路］、［走る］といった形態素を抽出することが可能である。形態素解析システムは、予め定めた形態素解析ルールを適用して、文を意味的最小単位である形態素（ｍｏｒｐｈｅｍｅ）に分節して品詞の認定処理を行なうシステムとして知られている。 Research on extracting morphemes from natural language documents has been conducted. For example, if it is a common document such as [car runs on road], it is possible to extract morphemes such as [car], [road], and [run] by applying a general morphological analysis system. . The morpheme analysis system is known as a system that applies a predetermined morpheme analysis rule to segment a sentence into morphemes that are semantic minimum units and perform a part-of-speech recognition process.

しかしながら、医療分野のように専門性の高い分野の専門用語を適切な形態素に区切ることは難しい。例えば、特定の病名である、
「潜在自己免疫性甲状腺炎」
この病名を含む文書に対して、一般的な形態素解析システムを適用して形態素に分割すると以下のような形態素が抽出される。以下は、「潜在自己免疫性甲状腺炎」から抽出されるデータを［形態素，品詞］の構成で示している。
（１）［潜在，名詞］
（２）［自己，名詞］
（３）［免疫，名詞］
（４）［性，接尾辞］
（５）［甲状腺，名詞］
（６）［炎，接尾辞］ However, it is difficult to divide technical terms in highly specialized fields such as the medical field into appropriate morphemes. For example, a specific disease name
"Potential autoimmune thyroiditis"
When a document including this disease name is divided into morphemes by applying a general morphological analysis system, the following morphemes are extracted. The following shows the data extracted from “latent autoimmune thyroiditis” in the form of [morpheme, part of speech].
(1) [Latent, noun]
(2) [Self, noun]
(3) [Immunity, Noun]
(4) [Gender, suffix]
(5) [Thyroid, noun]
(6) [Fire, Suffix]

上記（１）〜（６）の最小単位の形態素が得られると、（１）〜（６）を個別の検索キーとする形態素の設定の他、（１）＋（２）、（１）〜（４）、（２）〜（５）などの複数の連続する形態素の組を１つの検索キーやインデックスとするなど適切な形態素区切りを設定して、データ処理に適用する形態素の設定を行う。なお、上記（１）〜（６）は形態素解析システムを適用して得られる最小単位の形態素であり、さらに、これらの最小単位の形態素を複数組み合わせたデータも、検索キーやインデックスデータとして適用可能であり、これらのデータも形態素と呼ばれる。 When the morphemes of the minimum unit (1) to (6) are obtained, in addition to setting of morphemes using (1) to (6) as individual search keys, (1) + (2), (1) to Appropriate morpheme delimiters are set by setting a plurality of consecutive morpheme sets such as (4) and (2) to (5) as one search key or index, and morpheme to be applied to data processing is set. The above (1) to (6) are the smallest unit morphemes obtained by applying the morpheme analysis system, and data obtained by combining a plurality of these smallest unit morphemes can also be applied as search keys and index data. These data are also called morphemes.

しかし、実際には、
（２）〜（６）「自己免疫性甲状腺炎」や
（５）〜（６）「甲状腺炎」
は、医療テキスト中で単独でも用いられる語であり、例えば検索キーとして利用する価値がある形態素であるが、例えば、
（３）〜（６）「免疫性甲状腺炎」
は単独では存在し得ない。また、
（２）〜（４）「自己免疫性」
は医療テキスト中に単独ではほとんど存在し得ない形態素であるが、この（２）〜（４）「自己免疫性」は、例えば、
（２）〜（６）「自己免疫性甲状腺炎」
以外にも、
「自己免疫性肝炎」
「自己免疫性溶血性貧血」
「自己免疫性慢性甲状腺炎」
など複合名詞の一部としては広範かつ頻繁に用いられ、独立した形態素と認めることができる。 But in fact,
(2)-(6) "Autoimmune thyroiditis" and (5)-(6) "Thyroiditis"
Is a word used alone in medical texts, for example, a morpheme worth using as a search key,
(3)-(6) "Immune thyroiditis"
Cannot exist alone. Also,
(2)-(4) "Autoimmunity"
Is a morpheme that can hardly exist alone in medical text, but this (2)-(4) "autoimmunity"
(2)-(6) "Autoimmune thyroiditis"
Besides
"Autoimmune hepatitis"
"Autoimmune hemolytic anemia"
"Autoimmune chronic thyroiditis"
As part of compound nouns, it is used extensively and frequently, and can be recognized as an independent morpheme.

従って、「潜在自己免疫性甲状腺炎」は、
（１）「潜在」
（２）〜（４）「自己免疫性」
（５）〜（６）「甲状腺炎」
これらの３つの形態素に分割することが適切であると考えられる。例えば、これら３つの形態素を検索キーとして利用したデータベース検索処理を行なうことで、ノイズの少ない有効な情報をデータベースから効率的に取得することが可能となるなど、データ処理において適用する用語として最適であると言える。 Therefore, “latent autoimmune thyroiditis”
(1) “Latent”
(2)-(4) "Autoimmunity"
(5)-(6) "Thyroiditis"
It is considered appropriate to divide into these three morphemes. For example, performing database search processing using these three morphemes as search keys makes it possible to efficiently acquire effective information with less noise from the database. It can be said that there is.

しかしながら、医学分野の専門的な知識なしに、上記の形態素解析結果から、このような適切な形態素区切りを得ることは困難である。 However, it is difficult to obtain such an appropriate morpheme segmentation from the above morphological analysis results without specialized knowledge in the medical field.

上記の問題点を解決する方法として、形態素解析用の辞書（形態素リスト）に医学分野の専門用語辞書（専門用語リスト）を追加する方法が一般的に用いられる。特に「甲状腺炎」等の頻出する疾患名や部位名は医学分野の専門用語辞書に含まれている可能性が高いため、この手法によって「甲状腺炎」を一つの形態素に正しく区切ることができる。一方で、「自己免疫性」のように医療テキスト中に単独で出現することが少ない形態素は専門用語辞書に含まれていないことが多く、「自己」と「免疫性」に分割されてしまうことになる。（「免疫」と「性」は、「性」が接尾辞であるため一つの形態素に纏め上げる処理が行われる。） As a method for solving the above problems, a method of adding a technical term dictionary (technical term list) in the medical field to a dictionary for morphological analysis (morpheme list) is generally used. In particular, frequently used disease names and site names such as “thyroiditis” are likely to be included in the technical terminology dictionary in the medical field. Therefore, this method can correctly separate “thyroiditis” into one morpheme. On the other hand, morphemes that rarely appear alone in medical text, such as “autoimmunity”, are often not included in the technical term dictionary and are divided into “self” and “immunity”. become. (“Immunity” and “sex” are processed into one morpheme because “sex” is a suffix.)

「自己免疫性甲状腺炎」が専門用語辞書に含まれていれば「自己免疫性甲状腺炎」を一つの形態素として設定することが可能となるが、「自己免疫性慢性甲状腺炎」のように疾患名として用いられることはあるがその出現頻度が極めて少ないものは専門用語辞書に含まれておらず、上記同様、「自己」と「免疫性」を一つの形態素に正しく纏め上げることはできない。 If “autoimmune thyroiditis” is included in the technical term dictionary, it is possible to set “autoimmune thyroiditis” as one morpheme, but it is a disease such as “autoimmune chronic thyroiditis”. Those that are used as names but have a very low frequency of appearance are not included in the technical term dictionary, and as described above, “self” and “immunity” cannot be correctly combined into one morpheme.

「自己免疫性」のように専門用語辞書に含まれていない形態素を同定する手法の従来技術として、特許文献１（特開平６−１９９６８号公報）は、専門用語自動抽出手法を開示している。この手法は、通常の形態素解析結果の各形態素に品詞、文字種、出現頻度等から計算した得点を与え、その得点を基に、複数の形態素を纏め上げて専門用語とするか否かを決定する手法である。しかしながら、専門用語であるか否かは品詞、文字種、出現頻度といった要素だけで決定できるものではなく、この手法による専門用語抽出の抽出精度は実用上十分とは言い難いレベルにある。
特開平６−１９９６８号公報 As a conventional technique for identifying a morpheme that is not included in a technical term dictionary such as “autoimmunity”, Patent Document 1 (Japanese Patent Laid-Open No. 6-19968) discloses a technical term automatic extraction technique. . This method gives each morpheme of the normal morpheme analysis result a score calculated from the part of speech, character type, appearance frequency, etc., and decides whether or not to combine multiple morphemes into technical terms based on the score It is a technique. However, whether or not it is a technical term cannot be determined only by factors such as part of speech, character type, and appearance frequency, and the extraction accuracy of technical term extraction by this method is at a level that is not practically sufficient.
Japanese Patent Laid-Open No. 6-19968

本発明は、上述の問題点に鑑みてなされたものであり、医療分野など専門性の高い分野の専門用語を解析して、検索処理における検索キーなどに適用する適切な言語単位に区切る処理を実現する言語解析システム、および言語解析方法、並びにコンピュータ・プログラムを提供することを目的とする。 The present invention has been made in view of the above-mentioned problems, and analyzes a technical term in a highly specialized field such as the medical field and divides the process into appropriate language units to be applied to a search key or the like in a search process. An object of the present invention is to provide a language analysis system, a language analysis method, and a computer program.

本発明の第１の側面は、
日本語と日本語以外の外国語との対訳辞書を格納した対訳辞書格納手段と、
前記対訳辞書の登録データから、日本語と外国語の対訳ペアが一単語対一単語と異なる一単語対多単語の対訳ペアを抽出し、抽出した対訳ペアに含まれる日本語の形態素解析に基づく分割処理によって複数の区分語を生成し、生成した区分語または区分語列に対応する前記外国語の単語を、前記対訳辞書の登録データに基づいて解析する一対多単語抽出手段と、
前記一対多単語抽出手段において前記外国語の単語に対応すると判定された前記区分語または区分語列に相当する日本語を形態素として形態素解析用辞書に登録する処理を実行する形態素解析用辞書生成手段と、
を有することを特徴とする言語解析システムにある。 The first aspect of the present invention is:
A bilingual dictionary storage means for storing bilingual dictionaries of Japanese and foreign languages other than Japanese,
Based on the registered bilingual dictionary data, one-word-to-multi-word bilingual pairs whose Japanese-foreign language bilingual pairs are different from one-word to one-word are extracted, and based on the Japanese morphological analysis included in the extracted bilingual pairs A one-to-many word extracting unit that generates a plurality of segment words by dividing processing, and that analyzes the foreign language word corresponding to the generated segment word or segment word string based on registration data of the bilingual dictionary;
A morpheme analysis dictionary generation unit that executes processing for registering, in the morpheme analysis dictionary, Japanese corresponding to the segmented word or segmented word string determined to correspond to the foreign language word in the one-to-many word extraction unit; ,
A language analysis system characterized by having

さらに、本発明の言語解析システムの一実施態様において、前記言語解析システムは、さらに、前記形態素解析用辞書生成手段において生成された形態素解析用辞書を適用した形態素解析を実行する形態素解析手段を有することを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the language analysis system further includes a morpheme analysis unit that executes a morpheme analysis to which the morpheme analysis dictionary generated by the morpheme analysis dictionary generation unit is applied. It is characterized by that.

さらに、本発明の言語解析システムの一実施態様において、前記一対多単語抽出手段は、前記対訳辞書の登録データに含まれる日本語と外国語の対訳ペアが一単語対一単語の対訳ペアについては、該対訳ペアに含まれる日本語の単語を前記形態素解析用辞書生成手段に出力し、前記形態素解析用辞書生成手段は、前記一対多単語抽出手段から入力する前記日本語の単語を形態素解析用辞書に登録する処理を実行する構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the one-to-many word extracting unit is configured such that a Japanese-foreign bilingual translation pair included in the bilingual dictionary registration data is a one-word to one-word bilingual pair. The Japanese word included in the parallel translation pair is output to the morpheme analysis dictionary generation unit, and the morpheme analysis dictionary generation unit converts the Japanese word input from the one-to-many word extraction unit into the morpheme analysis dictionary. The present invention is characterized in that a process for registering is executed.

さらに、本発明の言語解析システムの一実施態様において、前記対訳辞書格納手段は、一般用語についての日本語と外国語との対訳辞書を格納した一般用語対訳辞書と、専門用語についての日本語と外国語との対訳辞書を格納した専門用語対訳辞書とを格納した構成であり、前記一対多単語抽出手段は、前記専門用語対訳辞書に登録されたデータから、日本語と外国語の対訳ペアが一単語対一単語と異なる一単語対多単語の対訳ペアを抽出し、抽出した対訳ペアに含まれる日本語の形態素解析により日本語を分割した複数の区分語を生成し、生成した個々の区分語または区分語列に対応する前記外国語の単語を、前記専門用語対訳辞書および前記一般用語対訳辞書の双方の登録データに基づいて解析する処理を実行する構成であることを特徴とする。 Further, in one embodiment of the language analysis system of the present invention, the bilingual dictionary storage means includes a general term bilingual dictionary storing a bilingual dictionary of Japanese and foreign languages for general terms, and Japanese for technical terms. The technical term bilingual dictionary storing a bilingual dictionary with a foreign language is stored, and the one-to-many word extraction unit is configured to search for a bilingual pair of Japanese and foreign language from data registered in the technical term bilingual dictionary. Extract single-word-multi-word bilingual pairs that are different from word-to-one words, generate multiple classifiers by dividing Japanese by morphological analysis of Japanese included in the extracted bilingual pairs, and generate individual classifiers Or it is the structure which performs the process which analyzes the word of the said foreign language corresponding to a division word string based on the registration data of both the said technical term bilingual dictionary and the said general term bilingual dictionary To.

さらに、本発明の言語解析システムの一実施態様において、前記専門用語対訳辞書は、医療分野の専門用語を格納した対訳辞書であり、前記一対多単語抽出手段は、医療分野の専門用語である日本語の形態素解析により複数の区分語を生成し、生成した区分語または区分語列に対応する前記外国語の単語を、前記専門用語対訳辞書および前記一般用語対訳辞書の双方の登録データに基づいて解析する処理を実行する構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the technical term bilingual dictionary is a bilingual dictionary that stores medical terms in the medical field, and the one-to-many word extraction means is a technical term in Japanese. A plurality of categorical words are generated by morphological analysis, and the foreign language words corresponding to the generated categorical words or categorical word strings are analyzed based on registration data of both the technical term bilingual dictionary and the general term bilingual dictionary It is the structure which performs the process to perform.

さらに、本発明の言語解析システムの一実施態様において、前記対訳辞書格納手段は、日本語と、英語との対訳辞書を格納した構成であり、一対多単語抽出手段は、前記対訳辞書に登録されたデータから、日本語と英語の対訳ペアが一単語対一単語と異なる一単語対多単語の対訳ペアを抽出し、抽出した対訳ペアに含まれる日本語の形態素解析により日本語を分割した複数の区分語を生成し、生成した区分語または区分語列に対応する英単語を、前記対訳辞書の登録データに基づいて解析する処理を実行する構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the bilingual dictionary storage means stores a bilingual dictionary of Japanese and English, and the one-to-many word extraction means is registered in the bilingual dictionary. Extract multiple bilingual pairs of Japanese and English bilingual pairs with different one-to-one bilingual pairs of Japanese and English, and split Japanese by morphological analysis of Japanese included in the extracted bilingual pairs It is the structure which performs the process which produces | generates a division word and analyzes the English word corresponding to the produced division word or a division word string based on the registration data of the said bilingual dictionary.

さらに、本発明の言語解析システムの一実施態様において、前記一対多単語抽出手段は、日本語の形態素解析により日本語を分割した複数の区分語を生成し、生成した区分語について、形態素解析において解析された区分語の品詞情報を適用して区分語の纏め処理を実行して区分語列を生成し、該区分語列に対応する外国語の単語を解析する処理を実行する構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the one-to-many word extraction unit generates a plurality of segment words obtained by dividing Japanese by a Japanese morpheme analysis, and the generated segment words are analyzed in the morpheme analysis. Applying the part-of-speech information of the segmented words to generate a segmented word string by executing a grouping process of the segmented words, and executing a process of analyzing a foreign word corresponding to the segmented word string Features.

さらに、本発明の言語解析システムの一実施態様において、前記一対多単語抽出手段は、日本語の形態素解析により日本語を分割した複数の区分語を生成し、生成した区分語に基づく区分語列の生成に際して、該区分語列の前記対訳辞書または他データベースにおける出現頻度を解析し、出現頻度の高い区分語列について、対応する外国語の単語を解析する処理を実行する構成であることを特徴とする。 Furthermore, in an embodiment of the language analysis system of the present invention, the one-to-many word extraction unit generates a plurality of segment words obtained by dividing Japanese by a Japanese morpheme analysis, and generates a segment word string based on the generated segment words. When generating, the frequency of appearance of the segmented word string in the bilingual dictionary or other database is analyzed, and the processing of analyzing a corresponding foreign language word is performed on the segmented word string having a high frequency of appearance. To do.

さらに、本発明の言語解析システムの一実施態様において、前記言語解析システムは、さらに、前記外国語のテキストデータを格納した外国語コーパス格納手段を有し、前記一対多単語抽出手段は、日本語の形態素解析によって生成した区分語または区分語列に対応する前記外国語の単語解析処理において、前記コーパス格納手段中の外国語テキストを参照し、該外国語テキスト中に頻出する単語列を特定し、該単語列を一つの外国語の単語とみなした解析処理を実行する構成であることを特徴とする。 Furthermore, in one embodiment of the language analysis system of the present invention, the language analysis system further includes a foreign language corpus storage means for storing the foreign language text data, and the one-to-many word extraction means is a Japanese language In the foreign language word analysis processing corresponding to the segment word or segment word string generated by morphological analysis, the foreign language text in the corpus storage means is referred to, and the word string that frequently appears in the foreign language text is specified, An analysis process is performed in which the word string is regarded as one foreign language word.

さらに、本発明の第２の側面は、
言語解析システムにおいて言語解析処理を実行する言語解析方法であり、
一対多単語抽出手段が、日本語と日本語以外の外国語との対訳辞書を格納した対訳辞書を参照し、該対訳辞書の登録データから日本語と外国語の対訳ペアが一単語対一単語と異なる一単語対多単語の対訳ペアを抽出し、抽出した対訳ペアに含まれる日本語の形態素解析に基づく分割処理によって複数の区分語を生成し、生成した区分語または区分語列に対応する前記外国語の単語を前記対訳辞書の登録データに基づいて解析する一対多単語抽出ステップと、
形態素解析用辞書生成手段が、前記一対多単語抽出ステップにおいて前記外国語の単語に対応すると判定された前記区分語または区分語列に相当する日本語を形態素として形態素解析用辞書に登録する処理を実行する形態素解析用辞書生成ステップと、
を有することを特徴とする言語解析方法にある。 Furthermore, the second aspect of the present invention provides
A language analysis method for executing language analysis processing in a language analysis system,
The one-to-many word extracting means refers to a bilingual dictionary storing a bilingual dictionary of Japanese and a foreign language other than Japanese, and the bilingual translation pair of Japanese and foreign language is one word to one word from the registered data of the bilingual dictionary. The bilingual pairs of different one-word vs. multi-words are extracted, a plurality of segment words are generated by the division processing based on the Japanese morphological analysis included in the extracted bi-translation pairs, and the corresponding segment words or segment word strings are A one-to-many word extraction step of analyzing a foreign language word based on registration data of the bilingual dictionary;
The morpheme analysis dictionary generating means executes a process of registering, in the morpheme analysis dictionary, Japanese corresponding to the segmented word or segmented word sequence determined to correspond to the foreign language word in the one-to-many word extraction step. A morphological analysis dictionary generation step to perform,
A language analysis method characterized by comprising:

さらに、本発明の言語解析方法の一実施態様において、前記言語解析方法は、さらに、形態素解析手段が、前記形態素解析用辞書生成ステップで生成した形態素解析用辞書を適用した形態素解析を実行する形態素解析ステップを有することを特徴とする。 Furthermore, in one embodiment of the language analysis method of the present invention, the language analysis method further includes a morpheme analysis unit that performs a morpheme analysis to which the morpheme analysis dictionary generated in the morpheme analysis dictionary generation step is applied. It has an analysis step.

さらに、本発明の言語解析方法の一実施態様において、前記一対多単語抽出ステップは、前記対訳辞書の登録データに含まれる日本語と外国語の対訳ペアが一単語対一単語の対訳ペアについては、該対訳ペアに含まれる日本語の単語を前記形態素解析用辞書生成手段に出力するステップを含み、前記形態素解析用辞書生成ステップは、前記一対多単語抽出ステップにおいて入力する前記日本語の単語を形態素解析用辞書に登録する処理を実行することを特徴とする。 Furthermore, in one embodiment of the language analysis method of the present invention, the one-to-many word extraction step includes a Japanese-foreign bilingual translation pair included in the bilingual dictionary registration data for a one-to-one word bilingual pair. Outputting a Japanese word included in the bilingual pair to the morphological analysis dictionary generating means, wherein the morphological analysis dictionary generating step morphologically analyzes the Japanese word input in the one-to-many word extraction step A process of registering in a dictionary for use is executed.

さらに、本発明の言語解析方法の一実施態様において、前記対訳辞書は、一般用語についての日本語と外国語との対訳辞書を格納した一般用語対訳辞書と、専門用語についての日本語と外国語との対訳辞書を格納した専門用語対訳辞書とを含み、前記一対多単語抽出ステップは、前記専門用語対訳辞書に登録されたデータから、日本語と外国語の対訳ペアが一単語対一単語と異なる一単語対多単語の対訳ペアを抽出し、抽出した対訳ペアに含まれる日本語の形態素解析により日本語を分割した複数の区分語を生成し、生成した個々の区分語または区分語列に対応する前記外国語の単語を、前記専門用語対訳辞書および前記一般用語対訳辞書の双方の登録データに基づいて解析する処理を実行するステップであることを特徴とする。 Furthermore, in one embodiment of the language analysis method of the present invention, the bilingual dictionary includes a general term bilingual dictionary storing a bilingual dictionary of Japanese and foreign languages for general terms, and Japanese and foreign languages for technical terms. And the bilingual word extraction dictionary stores a bilingual dictionary of Japanese and a foreign language from the data registered in the bilingual bilingual dictionary. Extract single-word-to-multi-word parallel translation pairs, generate multiple segment words by dividing Japanese by morphological analysis of Japanese included in the extracted parallel pairs, and correspond to each segment word or segment word string generated This is a step of executing a process of analyzing the foreign language word based on registration data of both the technical term bilingual dictionary and the general term bilingual dictionary.

さらに、本発明の言語解析方法の一実施態様において、前記専門用語対訳辞書は、医療分野の専門用語を格納した対訳辞書であり、前記一対多単語抽出ステップは、医療分野の専門用語である日本語の形態素解析により複数の区分語を生成し、生成した区分語または区分語列に対応する前記外国語の単語を、前記専門用語対訳辞書および前記一般用語対訳辞書の双方の登録データに基づいて解析する処理を実行するステップであることを特徴とする。 Further, in one embodiment of the language analysis method of the present invention, the technical term bilingual dictionary is a bilingual dictionary storing medical technical terms, and the one-to-many word extraction step is Japanese medical terms. A plurality of categorical words are generated by morphological analysis, and the foreign language words corresponding to the generated categorical words or categorical word strings are analyzed based on registration data of both the technical term bilingual dictionary and the general term bilingual dictionary It is the step which performs the process to perform.

さらに、本発明の言語解析方法の一実施態様において、前記対訳辞書は日本語と英語との対訳辞書であり、一対多単語抽出ステップは、前記対訳辞書に登録されたデータから、日本語と英語の対訳ペアが一単語対一単語と異なる一単語対多単語の対訳ペアを抽出し、抽出した対訳ペアに含まれる日本語の形態素解析により日本語を分割した複数の区分語を生成し、生成した区分語または区分語列に対応する英単語を、前記対訳辞書の登録データに基づいて解析する処理を実行するステップであることを特徴とする。 Further, in one embodiment of the language analysis method of the present invention, the bilingual dictionary is a bilingual dictionary of Japanese and English, and the one-to-many word extraction step is performed by using Japanese and English from the data registered in the bilingual dictionary. A bilingual pair of one-word-to-multiple words whose bilingual pairs are different from one-to-one word is extracted, and a plurality of segmented words are generated by dividing Japanese by morphological analysis of Japanese included in the extracted bilingual pairs. It is a step of executing a process of analyzing English words corresponding to a divided word or a divided word string based on registration data of the bilingual dictionary.

さらに、本発明の言語解析方法の一実施態様において、前記一対多単語抽出ステップは、日本語の形態素解析により日本語を分割した複数の区分語を生成し、生成した区分語について、形態素解析において解析された区分語の品詞情報を適用して区分語の纏め処理を実行して区分語列を生成し、該区分語列に対応する外国語の単語を解析する処理を実行するステップであることを特徴とする。 Furthermore, in an embodiment of the language analysis method of the present invention, the one-to-many word extraction step generates a plurality of segment words obtained by dividing Japanese by a Japanese morpheme analysis, and the generated segment words are analyzed in the morpheme analysis. Applying the part-of-speech information of the segmented words to execute a grouping process of the segment words to generate a segment word string, and to execute a process of analyzing a foreign word corresponding to the segment word string Features.

さらに、本発明の言語解析方法の一実施態様において、前記一対多単語抽出ステップは、日本語の形態素解析により日本語を分割した複数の区分語を生成し、生成した区分語に基づく区分語列の生成に際して、該区分語列の前記対訳辞書または他データベースにおける出現頻度を解析し、出現頻度の高い区分語列について、対応する外国語の単語を解析する処理を実行するステップであることを特徴とする。 Furthermore, in an embodiment of the language analysis method of the present invention, the one-to-many word extraction step generates a plurality of segment words obtained by dividing Japanese by a Japanese morphological analysis, and a segment word string based on the segment words generated is generated. When generating, it is a step of analyzing the appearance frequency of the segment word string in the bilingual dictionary or other database, and executing a process of analyzing a corresponding foreign language word for the segment word string having a high appearance frequency. To do.

さらに、本発明の言語解析方法の一実施態様において、前記一対多単語抽出ステップは、日本語の形態素解析によって生成した区分語または区分語列に対応する前記外国語の単語解析処理において、外国語のテキストデータを格納した外国語コーパス格納手段に格納された外国語テキストを参照し、該外国語テキスト中に頻出する単語列を特定し、該単語列を一つの外国語の単語とみなした解析処理を実行することを特徴とする。 Furthermore, in one embodiment of the language analysis method of the present invention, the one-to-many word extraction step includes the step of analyzing the foreign language in the foreign word analysis corresponding to the segment word or segment word string generated by the morphological analysis of Japanese. Analysis processing that refers to a foreign language text stored in a foreign language corpus storage means storing text data, identifies a word string that frequently appears in the foreign language text, and regards the word string as one foreign language word It is characterized by performing.

さらに、本発明の第３の側面は、
言語解析システムにおいて言語解析処理を実行させるコンピュータ・プログラムであり、
一対多単語抽出手段に、日本語と日本語以外の外国語との対訳辞書を格納した対訳辞書を参照させ、該対訳辞書の登録データから日本語と外国語の対訳ペアが一単語対一単語と異なる一単語対多単語の対訳ペアを抽出させて、抽出した対訳ペアに含まれる日本語の形態素解析に基づく分割処理によって複数の区分語を生成させ、生成した区分語または区分語列に対応する前記外国語の単語を前記対訳辞書の登録データに基づいて解析させる一対多単語抽出ステップと、
形態素解析用辞書生成手段に、前記一対多単語抽出ステップにおいて前記外国語の単語に対応すると判定された前記区分語または区分語列に相当する日本語を形態素として形態素解析用辞書に登録する処理を実行させる形態素解析用辞書生成ステップと、
を有することを特徴とするコンピュータ・プログラムにある。 Furthermore, the third aspect of the present invention provides
A computer program that executes language analysis processing in a language analysis system,
The one-to-many word extraction means refers to a bilingual dictionary storing a bilingual dictionary of Japanese and a foreign language other than Japanese, and the bilingual translation pair of Japanese and foreign language is converted into a one-to-one word from the registered data of the bilingual dictionary. A pair of different one-word-to-multi-word translations is extracted, and a plurality of segment words are generated by a segmentation process based on Japanese morphological analysis included in the extracted pair of translation pairs, and corresponding segment words or segment word strings are generated. A one-to-many word extraction step for analyzing the foreign language word based on registration data of the bilingual dictionary;
The morpheme analysis dictionary generating means executes a process of registering, in the morpheme analysis dictionary, Japanese corresponding to the segmented word or segmented word sequence determined to correspond to the foreign language word in the one-to-many word extraction step. A morphological analysis dictionary generation step to be performed;
There is a computer program characterized by comprising:

なお、本発明のコンピュータ・プログラムは、例えば、様々なプログラム・コードを実行可能なコンピュータシステムに対して、コンピュータ可読な形式で提供する記憶媒体、通信媒体、例えば、ＣＤやＦＤ、ＭＯなどの記録媒体、あるいは、ネットワークなどの通信媒体によって提供可能なコンピュータ・プログラムである。このようなプログラムをコンピュータ可読な形式で提供することにより、コンピュータシステム上でプログラムに応じた処理が実現される。 Note that the computer program of the present invention is a recording medium provided in a computer-readable format for a computer system capable of executing various program codes, for example, a recording medium such as a CD, FD, or MO. A computer program that can be provided by a medium or a communication medium such as a network. By providing such a program in a computer-readable format, processing corresponding to the program is realized on the computer system.

本発明のさらに他の目的、特徴や利点は、後述する本発明の実施例や添付する図面に基づくより詳細な説明によって明らかになるであろう。なお、本明細書においてシステムとは、複数の装置の論理的集合構成であり、各構成の装置が同一筐体内にあるものには限らない。 Other objects, features, and advantages of the present invention will become apparent from a more detailed description based on embodiments of the present invention described later and the accompanying drawings. In this specification, the system is a logical set configuration of a plurality of devices, and is not limited to one in which the devices of each configuration are in the same casing.

本発明の構成によれば、日本語と外国語との対訳辞書の登録データから日本語と外国語の対訳ペアが一単語対一単語と異なる一単語対多単語の登録データを持つ対訳ペアを抽出し、抽出した対訳ペアに含まれる日本語の形態素解析を行って複数の区分語を生成し、生成した区分語または区分語列に対応する外国語の単語を対訳辞書の登録データに基づいて取得し、取得した外国語の単語に対応する区分語または区分語列を形態素として形態素解析用辞書に登録し、登録した形態素情報に基づく形態素解析を行なうことを可能とした。本構成によれば、例えば医療分野などの専門分野の用語であっても、正しい区分に基づく形態素登録情報の生成、利用が可能となり、正確な形態素解析処理が実現される。 According to the configuration of the present invention, a bilingual pair having one-word-to-multi-word registered data in which a bilingual pair of Japanese and foreign language is different from one-word to one-word from registered data in a bilingual dictionary of Japanese and foreign languages. Extracts and analyzes Japanese morphemes contained in the extracted bilingual pairs to generate a plurality of segment words. Based on the registered data in the bilingual dictionary, foreign language words corresponding to the generated segment words or segment word strings are generated. The acquired vocabulary word or word sequence corresponding to the acquired foreign language word is registered as a morpheme in the morpheme analysis dictionary, and morpheme analysis based on the registered morpheme information can be performed. According to this configuration, for example, even in terms of a specialized field such as a medical field, morpheme registration information based on the correct classification can be generated and used, and an accurate morpheme analysis process is realized.

以下、図面を参照しながら本発明の実施形態に係る言語解析システム、および言語解析方法、並びにコンピュータ・プログラムの詳細について説明する。 Hereinafter, a language analysis system, a language analysis method, and a computer program according to embodiments of the present invention will be described in detail with reference to the drawings.

［１．基本実施例］
図１を参照して、本発明の一実施形態に係る言語解析システムの構成および処理について説明する。図１に示すように本発明の一実施形態に係る言語解析システム１００は、一般用語対訳辞書格納手段１０１、専門用語対訳辞書格納手段１０２、一対多単語抽出手段１０３、形態素解析用辞書生成手段１０４、形態素解析手段１０５を有する。なお、一対多単語抽出手段１０３は形態素解析手段１１１を有する。形態素解析用辞書生成手段１０４は形態素解析用辞書１１２を生成、更新する。 [1. Basic example]
With reference to FIG. 1, the configuration and processing of a language analysis system according to an embodiment of the present invention will be described. As shown in FIG. 1, a language analysis system 100 according to an embodiment of the present invention includes a general term bilingual dictionary storage unit 101, a technical term bilingual dictionary storage unit 102, a one-to-many word extraction unit 103, a morphological analysis dictionary generation unit 104, A morphological analysis unit 105 is included. The one-to-many word extracting unit 103 includes a morpheme analyzing unit 111. The morphological analysis dictionary generation means 104 generates and updates the morphological analysis dictionary 112.

以下、各手段の構成および処理について、具体的な例を示しながら説明を行なう。なお、本発明の言語解析システムは、専門分野の専門用語を含む文書の解析を行なうシステムであり、以下の実施例では、専門分野の一例として医療分野を取り上げて説明するが、本発明は医療分野に限らず、その他の専門分野、例えば経済、建築、技術などの様々な専門分野の専門用語の解析に適用可能である。 Hereinafter, the configuration and processing of each unit will be described with specific examples. The language analysis system of the present invention is a system that analyzes documents including technical terms in specialized fields. In the following embodiments, the medical field will be described as an example of specialized fields. The present invention can be applied to analysis of technical terms in various specialized fields such as economics, architecture, technology, etc.

図１に示す言語解析システム１００を構成する各手段について説明する。
［一般用語対訳辞書格納手段］
一般用語対訳辞書格納手段１０１は、一般的な単語の日英対訳ペアを複数格納する手段であり、［日本語−英語］の対訳ペアを格納した辞書データベースである。具体的な格納データの例としては、例えば、
「潜在−ｌａｔｅｎｔ」
「実験的−ｅｘｐｅｒｉｍｅｎｔａｌ」
といった一般的な単語の日英の単語ペアを複数格納した辞書データベースとして構成されている。 Each means which comprises the language analysis system 100 shown in FIG. 1 is demonstrated.
[General term bilingual dictionary storage means]
The general term parallel translation dictionary storage means 101 is a means for storing a plurality of general Japanese-English parallel translation pairs, and is a dictionary database storing [Japanese-English] parallel translation pairs. As an example of specific stored data, for example,
"Latent-latent"
"Experimental-experimental"
It is configured as a dictionary database storing a plurality of Japanese-English word pairs of general words such as

［専門用語対訳辞書格納手段］
専門用語対訳辞書格納手段１０２は、医学分野の専門用語の日英対訳ペアを複数格納した辞書データベースである。具体的な格納データの例としては、例えば、
「甲状腺炎−ｔｈｙｒｏｉｄｉｔｉｓ」
「脳脊髄炎− ｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ」
「自己免疫性甲状腺炎−ａｕｔｏｉｍｍｕｎｅｔｈｙｒｏｉｄｉｔｉｓ」
「実験的自己免疫性脳脊髄炎−ｅｘｐｅｒｉｍｅｎｔａｌａｕｔｏｉｍｍｕｎｅｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ」
といった医学分野の専門用語の日英の専門用語対訳ペアを複数格納した辞書データベースとして構成されている。 [Technical term bilingual dictionary storage means]
The technical term parallel dictionary storage means 102 is a dictionary database that stores a plurality of Japanese-English parallel translation pairs of technical terms in the medical field. As an example of specific stored data, for example,
"Thyroiditis"
"Encephalomyelitis-encephalomyelitis"
"Autoimmune thyroiditis-autoimmune thyroiditis"
"Experimental autoimmune encephalomyelitis-experimental autoimmune encephalomyelitis"
It is configured as a dictionary database that stores multiple bilingual pairs of technical terms in Japanese and English.

［一対多単語抽出手段］
一対多単語抽出手段１０３は、形態素解析手段１１１を有し、専門用語対訳辞書格納手段１０２中の日英対訳ペアから抽出される日本語専門用語に形態素解析処理を施して、形態素解析に基づく日本語の分割処理によって複数の区分語（形態素）を生成し、生成した区分語（形態素）または区分語列（形態素列）に対応する外国語の単語を一般用語対訳辞書格納手段１０１および専門用語対訳辞書格納手段１０２に格納された対訳辞書の登録データから検索し、外国語の単語に対応すると判定された区分語（形態素）または区分語列（形態素列）に対応する日本語を１つの形態素として形態素解析用辞書に登録するデータとして形態素解析用辞書生成手段１０４に渡す処理を実行する。 [One-to-many word extraction means]
The one-to-many word extraction unit 103 includes a morpheme analysis unit 111, which performs morpheme analysis processing on Japanese technical terms extracted from the Japanese-English bilingual translation pair in the technical term bilingual dictionary storage unit 102, and based on morphological analysis A plurality of segment words (morphemes) are generated by the division process of the above, and the foreign language words corresponding to the generated segment words (morphemes) or segment word strings (morpheme strings) are stored in the general term bilingual dictionary storage unit 101 and the technical term bilingual dictionary A morpheme is retrieved from the registered data of the bilingual dictionary stored in the storage means 102, and Japanese corresponding to a segmented word (morpheme) or a segmented word string (morpheme string) determined to correspond to a foreign language word is regarded as one morpheme. Processing to pass to the morphological analysis dictionary generation means 104 as data to be registered in the analysis dictionary is executed.

ただし、専門用語対訳辞書格納手段１０２中の日英対訳ペアから抽出される英語専門用語が、［日本語−英語］の一単語同士の対訳ペア、例えば、
「甲状腺炎−ｔｈｙｒｏｉｄｉｔｉｓ」
「脳脊髄炎− ｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ」等
このような［日本語−英語］の一単語同士の対訳ペアとして設定されている日本語専門用語に対しては、形態素解析処理を行なわず、専門用語対訳辞書格納手段１０２中の日英対訳ペアから抽出される日本語専門用語を、そのまま形態素解析用辞書生成手段１０４にそのまま出力する。形態素解析用辞書生成手段１０４は、一対多単語抽出手段１０３から入力するこのような日本語の単語を形態素解析用辞書１１２に登録する。 However, an English technical term extracted from a Japanese-English bilingual translation pair in the technical term bilingual dictionary storage means 102 is a bilingual pair of [Japanese-English] words, for example,
"Thyroiditis"
“Encephalomyelitis-encephalomyelitis” etc. For Japanese technical terms set as a bilingual pair of [Japanese-English] words, the technical term bilingual dictionary is not stored. The Japanese technical terms extracted from the Japanese-English bilingual pair in the means 102 are output to the morphological analysis dictionary generation means 104 as they are. The morpheme analysis dictionary generation unit 104 registers such a Japanese word input from the one-to-multiword extraction unit 103 in the morpheme analysis dictionary 112.

一方、一対多単語抽出手段１０３は、例えば、専門用語対訳辞書格納手段１０２中の日英対訳ペアから抽出される［日本語−英語］の対訳ペアにおいて日本語と外国語の対訳ペアが一単語対一単語と異なる一単語対多単語の対訳ペアを抽出し、抽出した対訳ペアに含まれる日本語の形態素解析に基づく分割処理によって複数の区分語を生成し、生成した区分語または区分語列に対応する前記外国語の単語を、対訳辞書の登録データに基づいて解析する。例えば、日本語の一単語に対して、英語が複数の単語によって構成されている対訳ペアの日本語専門用語については、形態素解析手段１１１において形態素解析を実行する。この形態素解析は、既存の形態素解析用辞書を適用した形態素解析処理として行われる。 On the other hand, the one-to-many word extraction unit 103, for example, has a Japanese-English bilingual pair extracted from a Japanese-English bilingual pair in the technical term bilingual dictionary storage unit 102. Extract one-to-many multi-word bilingual pairs different from one word, generate multiple segment words by dividing processing based on Japanese morphological analysis included in the extracted bi-directional pairs, and generate the segment words or segment word strings The corresponding foreign language word is analyzed based on the registered data in the bilingual dictionary. For example, a morphological analysis unit 111 performs morpheme analysis on a Japanese technical term of a bilingual pair in which English is composed of a plurality of words with respect to one Japanese word. This morpheme analysis is performed as a morpheme analysis process to which an existing dictionary for morpheme analysis is applied.

具体的な例について説明する。例えば、専門用語対訳辞書格納手段１０２中の日英対訳ペアから抽出される［日本語−英語］の対訳ペアにおいて、日本語の一単語に対して、英語が複数の単語によって構成されている対訳ペアとしては、以下の対訳ペアがある。
［自己免疫性甲状腺炎−ａｕｔｏｉｍｍｕｎｅｔｈｙｒｏｉｄｉｔｉｓ］
この対訳ペアの日本語専門用語である、
「自己免疫性甲状腺炎」
上記の日本語の処理例について説明する。 A specific example will be described. For example, in a [Japanese-English] bilingual pair extracted from a Japanese-English bilingual translation pair in the technical term bilingual dictionary storage means 102, a bilingual translation in which English is composed of a plurality of words for one Japanese word. There are the following translation pairs as pairs.
[Autoimmune thyroiditis-autoimmune thyroiditis]
The Japanese terminology of this translation pair,
"Autoimmune thyroiditis"
An example of the above Japanese processing will be described.

一対多単語抽出手段１０３は、形態素解析手段１１１において、
「自己免疫性甲状腺炎」
に対する形態素解析処理を施す。前述したように形態素解析は、予め定めた形態素解析ルールを適用して、文を意味的最小単位である区分語（形態素：ｍｏｒｐｈｅｍｅ）に分節して品詞の認定処理を行なう処理である。形態素解析により、以下の解析結果が得られる。以下のデータは、［形態素（区分語），品詞］の構成で示している
（１）［自己，名詞］
（２）［免疫，名詞］
（３）［性，接尾辞］
（４）［甲状腺，名詞］
（５）［炎，接尾辞］ The one-to-many word extracting means 103 is the morpheme analyzing means 111,
"Autoimmune thyroiditis"
A morphological analysis process is applied to. As described above, the morpheme analysis is a process of applying a predetermined morpheme analysis rule and segmenting a sentence into segment words (morpheme) that are semantic minimum units to perform a part of speech recognition process. The following analysis results are obtained by morphological analysis. The following data is shown in the form of [morpheme (category), part of speech] (1) [self, noun]
(2) [Immunity, Noun]
(3) [Gender, suffix]
(4) [Thyroid, noun]
(5) [Fire, Suffix]

形態素解析手段１１１は、上記の最小単位の形態素抽出結果から、さらに、データ処理において有効なデータ単位としての区分語（形態素）を設定する。例えば、名詞に接尾辞が後接する場合は纏めて一つの形態素とみなす処理により、（２）＋（３）、および（４）＋（５）の纏め処理が実行されて、「自己免疫性甲状腺炎」は、
（ａ）「自己」
（ｂ）「免疫性」
（ｃ）「甲状腺炎」
の３つの区分語（形態素）に分割される。 The morpheme analyzing unit 111 further sets a classification word (morpheme) as an effective data unit in the data processing from the above-described minimum unit morpheme extraction result. For example, when the suffix is suffixed to a noun, the summarization process of (2) + (3) and (4) + (5) is executed by the process of collectively considering it as one morpheme, and “autoimmune thyroid gland” "Fire"
(A) "Self"
(B) “Immunity”
(C) “Thyroiditis”
Are divided into three segment words (morphemes).

なお、上述の形態素の纏め処理は、品詞解析による纏め処理例であるが、例えば専門分野の文書データベースの格納文書の解析により、複数の最小形態素（例えば上記（１）〜（５）の形態素）の列が出現する回数をカウントして、予め設定した閾値以上の出現回数がある形態素列を１つの形態素として決定する処理を行なう構成としてもよい。 The above-described morpheme summarization process is an example of the summarization process based on part-of-speech analysis. For example, a plurality of minimum morphemes (for example, morphemes (1) to (5) above) are obtained by analyzing stored documents in a document database in a specialized field. It is also possible to perform a process of counting the number of occurrences of the sequence and determining a morpheme sequence having the number of occurrences equal to or greater than a preset threshold value as one morpheme.

次に、一対多単語抽出手段１０３は、一般用語対訳辞書格納手段１０１および専門用語対訳辞書格納手段１０２中のそれぞれの対訳辞書を用いて、上記（ａ）〜（ｃ）の３つの区分語（形態素）に対応する英単語の特定を行なう。
（ａ）「自己」
（ｂ）「免疫性」
（ｃ）「甲状腺炎」 Next, the one-to-many word extraction unit 103 uses the bilingual dictionaries in the general term bilingual dictionary storage unit 101 and the technical term bilingual dictionary storage unit 102 to use the three divided words (morpheme) (a) to (c) above. ) Is identified.
(A) "Self"
(B) “Immunity”
(C) “Thyroiditis”

専門用語対訳辞書格納手段１０２には、前述したように、［日本語−英語］の対訳ペアとして、
［自己免疫性甲状腺炎−ａｕｔｏｉｍｍｕｎｅｔｈｙｒｏｉｄｉｔｉｓ」
上記の対訳ペアが登録されている。
一対多単語抽出手段１０３は、まず、上記の対訳ペアに含まれる各英単語について、一般用語対訳辞書格納手段１０１および専門用語対訳辞書格納手段１０２に記録された日英対訳ペアに、単独の英単語として登録されているか否かを検索する。すなわち、上記の例では、「ａｕｔｏｉｍｍｕｎｅ」、「ｔｈｙｒｏｉｄｉｔｉｓ」の各々について、単独の英単語が対訳ペアとして登録されているか否かを判定する。 As described above, the technical term bilingual dictionary storage means 102 has [Japanese-English] bilingual pairs as follows.
[Autoimmune thyroiditis-autoimmune thyroiditis]
The above translation pairs are registered.
The one-to-many word extraction unit 103 first converts each English word contained in the above-described bilingual pair into a Japanese-English bilingual translation pair recorded in the general term bilingual dictionary storage unit 101 and the technical term bilingual dictionary storage unit 102. It is searched whether it is registered as. That is, in the above example, it is determined whether or not a single English word is registered as a bilingual pair for each of “autoimmune” and “thyridotis”.

例えば、「ａｕｔｏｉｍｍｕｎｅ」、「ｔｈｙｒｏｉｄｉｔｉｓ」のうち、「ｔｈｙｒｏｉｄｉｔｉｓ」のみが、専門用語対訳辞書格納手段１０２に、
「甲状腺炎−ｔｈｙｒｏｉｄｉｔｉｓ」
として登録されていることが検出されると、一対多単語抽出手段１０３は、
［自己免疫性甲状腺炎−ａｕｔｏｉｍｍｕｎｅｔｈｙｒｏｉｄｉｔｉｓ」
上記対訳ペアに含まれる２つの英単語中の「ｔｈｙｒｏｉｄｉｔｉｓ」が、「甲状腺炎」に対応することを特定することができる。 For example, among “autoimmune” and “thyridotis”, only “thyroidis” is stored in the technical term bilingual dictionary storage means 102.
"Thyroiditis"
Is detected as one-to-many word extraction means 103,
[Autoimmune thyroiditis-autoimmune thyroiditis]
It can be specified that “thyroiditis” in the two English words included in the parallel translation pair corresponds to “thyroiditis”.

さらに、一対多単語抽出手段１０３は、上記の対応、すなわち、
「甲状腺炎−ｔｈｙｒｏｉｄｉｔｉｓ」
この対応の判別結果と、専門用語対訳辞書格納手段１０２に登録されている対訳ペア、
［自己免疫性甲状腺炎−ａｕｔｏｉｍｍｕｎｅｔｈｙｒｏｉｄｉｔｉｓ」
これらの２つの登録された対訳ペアデータに基づいて、日本語中の形態素「自己」と「免疫性」の２形態素は、１つの英単語、すなわち、
「ａｕｔｏｉｍｍｕｎｅ」
に対応すると判定する。 Furthermore, the one-to-many word extracting means 103 is adapted to the above-mentioned correspondence, that is,
"Thyroiditis"
The determination result of this correspondence and the bilingual pair registered in the technical term bilingual dictionary storage means 102,
[Autoimmune thyroiditis-autoimmune thyroiditis]
Based on these two registered bilingual pair data, the two morphemes in Japanese, “self” and “immunity”, are one English word, ie
"Autoimmune"
It is determined that it corresponds to.

結果として、一対多単語抽出手段１０３は、１つの日本語に対する複数の英単語が対応付けられて、専門用語対訳辞書格納手段１０２に登録されている対訳ペア、
［自己免疫性甲状腺炎−ａｕｔｏｉｍｍｕｎｅｔｈｙｒｏｉｄｉｔｉｓ」
について、
「甲状腺炎−ｔｈｙｒｏｉｄｉｔｉｓ」
「自己免疫性−ａｕｔｏｉｍｍｕｎｅ」
これらの各英単語に対応する日本語を判別する。一対多単語抽出手段１０３は、このように、１つの英単語に対応する区分語（形態素）または区分語列（形態素の組）を１つの形態素として、形態素解析用辞書生成手段１０４に出力する。形態素解析用辞書生成手段１０４は、「甲状腺炎」と「自己免疫性」を形態素として形態素解析用辞書１１２に登録する。 As a result, the one-to-many word extraction unit 103 associates a plurality of English words with respect to one Japanese word, and registers the bilingual pairs registered in the technical term bilingual dictionary storage unit 102.
[Autoimmune thyroiditis-autoimmune thyroiditis]
about,
"Thyroiditis"
"Autoimmune -autoimmune"
The Japanese corresponding to each of these English words is determined. As described above, the one-to-many word extraction unit 103 outputs a segment word (morpheme) or a segment word string (a set of morphemes) corresponding to one English word to the morpheme analysis dictionary generation unit 104 as one morpheme. The morphological analysis dictionary generation means 104 registers “thyroiditis” and “autoimmunity” in the morphological analysis dictionary 112 as morphemes.

一対多単語抽出手段１０３におけるもう１つの処理例について説明する。専門用語対訳辞書格納手段１０２中の日英対訳ペアから抽出される［日本語−英語］の対訳ペアにおいて、日本語の一単語に対して、英語が複数の単語によって構成されている対訳ペアの例として、
［実験的自己免疫性脳脊髄炎−ｅｘｐｅｒｉｍｅｎｔａｌａｕｔｏｉｍｍｕｎｅｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ］
上記対訳ペアに対する処理例について説明する。 Another processing example in the one-to-many word extraction unit 103 will be described. In a Japanese-English bilingual pair extracted from a Japanese-English bilingual translation pair in the technical term bilingual dictionary storage means 102, a bilingual pair in which English is composed of a plurality of words with respect to one Japanese word. As an example,
[Experimental autoimmune encephalomyelitis-experimental autoimmune encephalomyelitis]
A processing example for the parallel translation pair will be described.

一対多単語抽出手段１０３は、形態素解析手段１１１において、
「実験的自己免疫性脳脊髄炎」
に対する形態素解析処理を実行する。形態素解析結果として、以下の解析結果が得られる。以下のデータは、［形態素，品詞］の構成で示している
（１）［実験，名詞］
（２）［的，接尾辞］
（３）［自己，名詞］
（４）［免疫，名詞］
（５）［性，接尾辞］
（６）［脳，名詞］
（７）［脊髄，名詞］
（８）［炎，接尾辞］ The one-to-many word extracting means 103 is the morpheme analyzing means 111,
"Experimental autoimmune encephalomyelitis"
Execute morphological analysis processing for. The following analysis results are obtained as morphological analysis results. The following data is shown in the form of [morpheme, part of speech] (1) [experiment, noun]
(2) [Target, suffix]
(3) [Self, noun]
(4) [Immunity, Noun]
(5) [Gender, suffix]
(6) [Brain, noun]
(7) [Spine, noun]
(8) [Fire, Suffix]

形態素解析手段１１１は、上記の最小単位の形態素抽出結果から、さらに、データ処理において有効なデータ単位としての形態素を設定する。例えば、名詞に接尾辞が後接する場合は纏めて一つの形態素とみなす処理により、（１）＋（２）、（４），＋（５）、（７）＋（８）の丸め処理が実行されて、「実験的自己免疫性脳脊髄炎」は、
（ａ）「実験的」
（ｂ）「自己」
（ｃ）「免疫性」
（ｄ）「脳」
（ｅ）「脊髄炎」
これらの５つの形態素に分割される。 The morpheme analyzing unit 111 further sets a morpheme as an effective data unit in the data processing from the morpheme extraction result of the minimum unit. For example, when a suffix is suffixed to a noun, rounding processing of (1) + (2), (4), + (5), (7) + (8) is executed by processing that is regarded as one morpheme. "Experimental autoimmune encephalomyelitis"
(A) "Experimental"
(B) "Self"
(C) “Immunity”
(D) “Brain”
(E) "Myelitis"
Divided into these five morphemes.

次に、一対多単語抽出手段１０３は、一般用語対訳辞書格納手段１０１および専門用語対訳辞書格納手段１０２中のそれぞれの対訳辞書を用いて、上記（ａ）〜（ｅ）の５つの形態素に対応する英単語の特定を行なう。 Next, the one-to-many word extraction unit 103 uses the bilingual dictionaries in the general term bilingual dictionary storage unit 101 and the technical term bilingual dictionary storage unit 102 to correspond to the above five morphemes (a) to (e). Identify English words.

専門用語対訳辞書格納手段１０２には、前述したように、［日本語−英語］の対訳ペアとして、
［実験的自己免疫性脳脊髄炎−ｅｘｐｅｒｉｍｅｎｔａｌａｕｔｏｉｍｍｕｎｅｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ］
上記の対訳ペアが登録されている。
一対多単語抽出手段１０３は、まず、上記の対訳ペアに含まれる各英単語について、一般用語対訳辞書格納手段１０１および専門用語対訳辞書格納手段１０２に記録された日英対訳ペアに、単独の英単語として登録されているか否かを検索する。すなわち、上記の例では、
「ｅｘｐｅｒｉｍｅｎｔａｌ」
「ａｕｔｏｉｍｍｕｎｅ」
「ｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ」
これらの単語の登録があるか否かを判定する。 As described above, the technical term bilingual dictionary storage means 102 has [Japanese-English] bilingual pairs as follows.
[Experimental autoimmune encephalomyelitis-experimental autoimmune encephalomyelitis]
The above translation pairs are registered.
The one-to-many word extraction unit 103 first converts each English word contained in the above-described bilingual pair into a Japanese-English bilingual translation pair recorded in the general term bilingual dictionary storage unit 101 and the technical term bilingual dictionary storage unit 102. It is searched whether it is registered as. That is, in the above example,
"Experimental"
"Autoimmune"
"Encephalomyelitis"
It is determined whether or not these words are registered.

判定結果として、
「ｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ」が、専門用語対訳辞書格納手段１０２の登録対訳ペアとして検出されたとする。すなわち、
「脳脊髄炎− ｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ」
上記対訳ペアが専門用語対訳辞書格納手段１０２の登録データとして抽出されたとする。 As a judgment result,
It is assumed that “encephalomyelitis” is detected as a registered bilingual pair in the technical term bilingual dictionary storage unit 102. That is,
"Encephalomyelitis-encephalomyelitis"
It is assumed that the bilingual pair is extracted as registration data in the technical term bilingual dictionary storage unit 102.

一対多単語抽出手段１０３は、上記対訳ペアの検出に基づいて、上記（ａ）〜（ｅ）の５つの形態素解析結果中から、
（ｄ）「脳」
（ｅ）「脊髄炎」
によって構成される語［脳脊髄炎］を１つの英単語に対応する１つの形態素と判定する。 The one-to-many word extraction means 103, based on the detection of the parallel translation pair, from among the five morphological analysis results (a) to (e) above,
(D) “Brain”
(E) "Myelitis"
Is determined to be one morpheme corresponding to one English word.

さらに、「ｅｘｐｅｒｉｍｅｎｔａｌ」が、一般用語対訳辞書格納手段１０１の登録対訳ペアとして検出されたとする。すなわち、
「実験的−ｅｘｐｅｒｉｍｅｎｔａｌ」
上記対訳ペアが一般用語対訳辞書格納手段１０１の登録データとして抽出されたとする。 Furthermore, it is assumed that “experimental” is detected as a registered translation pair in the general term parallel dictionary storage unit 101. That is,
"Experimental-experimental"
It is assumed that the bilingual pair is extracted as registration data in the general term bilingual dictionary storage unit 101.

一対多単語抽出手段１０３は、上記対訳ペアの検出に基づいて、上記（ａ）〜（ｅ）の５つの形態素解析結果中から、
（ａ）「実験的」
を１つの英単語に対応する１つの形態素と判定する。 The one-to-many word extraction means 103, based on the detection of the parallel translation pair, from among the five morphological analysis results (a) to (e) above,
(A) "Experimental"
Is determined as one morpheme corresponding to one English word.

さらに、一対多単語抽出手段１０３は、
「脳脊髄炎− ｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ」
「実験的−ｅｘｐｅｒｉｍｅｎｔａｌ」
これらの各英単語の対応関係に基づいて、残りの１つの英単語「ａｕｔｏｉｍｍｕｎｅ」は、上記（ａ）〜（ｅ）の５つの形態素解析結果中の残りの２つの形態素、すなわち、
（ｂ）「自己」
（ｃ）「免疫性」
に対応すると判定する。すなわち、
「自己免疫性−ａｕｔｏｉｍｍｕｎｅ」
の対応関係にあると判定する。 Furthermore, the one-to-many word extraction means 103
"Encephalomyelitis-encephalomyelitis"
"Experimental-experimental"
Based on the correspondence between these English words, the remaining one English word “autoimmune” is the remaining two morphemes in the five morphological analysis results of (a) to (e) above,
(B) "Self"
(C) “Immunity”
It is determined that it corresponds to. That is,
"Autoimmune -autoimmune"
It is determined that there is a correspondence relationship.

結果として、一対多単語抽出手段１０３は、１つの日本語に対する複数の英単語が対応付けられて、専門用語対訳辞書格納手段１０２に登録されている対訳ペア、
［実験的自己免疫性脳脊髄炎−ｅｘｐｅｒｉｍｅｎｔａｌａｕｔｏｉｍｍｕｎｅｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ］
について、
「実験的−ｅｘｐｅｒｉｍｅｎｔａｌ」
「自己免疫性−ａｕｔｏｉｍｍｕｎｅ」
「脳脊髄炎−ｅｎｃｅｐｈａｌｏｍｙｅｌｉｔｉｓ」
これらの各英単語に対応する日本語を判別する。一対多単語抽出手段１０３は、このように、１つの英単語に対応する形態素または形態素の組を１つの形態素として、形態素解析用辞書生成手段１０４に出力する。形態素解析用辞書生成手段１０４は、「実験的」と「自己免疫性」と「脳脊髄炎」を形態素として形態素解析用辞書１１２に登録する。 As a result, the one-to-many word extraction unit 103 associates a plurality of English words with respect to one Japanese word, and registers the bilingual pairs registered in the technical term bilingual dictionary storage unit 102.
[Experimental autoimmune encephalomyelitis-experimental autoimmune encephalomyelitis]
about,
"Experimental-experimental"
"Autoimmune -autoimmune"
"Encephalomyelitis-encephalomyelitis"
The Japanese corresponding to each of these English words is determined. In this way, the one-to-many word extraction unit 103 outputs a morpheme or a set of morphemes corresponding to one English word to the morpheme analysis dictionary generation unit 104 as one morpheme. The morphological analysis dictionary generation means 104 registers “experimental”, “autoimmunity”, and “encephalomyelitis” in the morphological analysis dictionary 112 as morphemes.

一対多単語抽出手段１０３は、このように、一つの英単語に対応する２つ以上の日本語形態素列であり、かつ、一般用語対訳辞書格納手段１０１および専門用語対訳辞書格納手段１０２中に含まれていない日本語形態素列を抽出し、これを形態素解析用辞書生成手段１０４に出力する。形態素解析用辞書生成手段１０４は、これら、一対多単語抽出手段１０３から入力するデータを１つの形態素データとして形態素解析用辞書１１２に登録する。 The one-to-many word extracting unit 103 is thus two or more Japanese morpheme sequences corresponding to one English word, and is included in the general term bilingual dictionary storage unit 101 and the technical term bilingual dictionary storage unit 102. A Japanese morpheme string not extracted is extracted and output to the morpheme analysis dictionary generation means 104. The morpheme analysis dictionary generation unit 104 registers the data input from the one-to-multiword extraction unit 103 in the morpheme analysis dictionary 112 as one morpheme data.

英語において一つの単語で表されている医学分野の概念は、一つの単語で表現されている以上、当然ながら固定的な１つの概念を示している。日本語の医学専門用語の多くは英語の医学専門用語の翻訳である。従って、一つの英語単語に対応する複数の日本語形態素列は、一つの概念を表している可能性が極めて高く、それらの形態素列を一つの形態素とみなすことが適切であると考えられる。 The concept of the medical field expressed by one word in English is naturally a fixed concept as long as it is expressed by one word. Many medical terms in Japanese are translations of English medical terms. Therefore, a plurality of Japanese morpheme strings corresponding to one English word are very likely to represent one concept, and it is considered appropriate to regard these morpheme strings as one morpheme.

従来の形態素解析システムにおいて複数の日本語の形態素に分割されてしまう様々な日本語専門用語、例えば、「後頭側頭」、「イヌネコ回虫症」、「回結腸炎」、「腫瘍性タンパク質」これらの専門用語についても、上述した一対多単語抽出手段１０３の処理を行なうことで、これらの専門用語が１つの英単語に対応付けられることが解析される。すなわち、
「後頭側頭（ｏｃｃｉｐｉｔｏｔｅｍｐｏｒａｌ）」
「イヌネコ回虫症（ｔｏｘｏｃａｒｉａｓｉｓ）」
「回結腸炎（ｉｌｅｏｃｏｌｉｔｉｓ）」
「腫瘍性タンパク質（ｏｎｃｏｐｒｏｔｅｉｎ）」等、
上記のように、「後頭側頭」、「イヌネコ回虫症」、「回結腸炎」、「腫瘍性タンパク質」これらの専門用語が１つの英単語に対応することが解析され、これらの専門用語、すなわち、「後頭側頭」、「イヌネコ回虫症」、「回結腸炎」、「腫瘍性タンパク質」を形態素として形態素解析用辞書に登録することができる。 Various Japanese technical terms that are divided into multiple Japanese morphemes in the conventional morphological analysis system, such as “occipital temporal”, “dog caterpillar disease”, “colonitis”, “neoplastic protein”, etc. Also, it is analyzed that these technical terms are associated with one English word by performing the above-described processing of the one-to-many word extracting unit 103. That is,
“Occipitotemporal”
"Toxocariasis"
"Ileocolitis"
“Oncoprotein”, etc.
As described above, “occipital temporal”, “canine roundworm”, “colonitis”, “oncoprotein” were analyzed that these terms correspond to one English word, That is, “occipital temporal”, “canine roundworm”, “colonitis”, and “oncoprotein” can be registered in the morphological analysis dictionary as morphemes.

［形態素解析用辞書生成手段］
形態素解析用辞書生成手段１０４は、従来型の形態素解析システムの適用する一般的な形態素解析用辞書をベースとして、上述した一対多単語抽出手段１０３から入力する形態素データを追加登録した形態素解析用辞書１１２の生成、更新処理を実行する。 [Dictionary generation means for morphological analysis]
The morpheme analysis dictionary generation unit 104 is based on a general morpheme analysis dictionary applied by a conventional morpheme analysis system, and additionally stores morpheme data input from the one-to-multiword extraction unit 103 described above. Generate and update the file.

すなわち、一般用語対訳辞書格納手段１０１中の日本語単語、専門用語対訳辞書格納手段１０２中の「英語専門用語が単一の英単語であるペア中の日本語専門用語」、および、一対多単語抽出手段１０３で抽出した「英語専門用語中の一単語に複数の日本語形態素が対応している日本語形態素列」（例えば「自己免疫性」）を、形態素データとして追加した形態素解析用辞書１１２を生成、更新する。 That is, Japanese words in the general term bilingual dictionary storage unit 101, “Japanese technical terms in a pair whose English technical term is a single English word” in the technical term bilingual dictionary storage unit 102, and one-to-many word extraction A morpheme analysis dictionary 112 in which “Japanese morpheme sequences corresponding to a single word in English technical terms” (for example, “autoimmunity”) extracted by the means 103 is added as morpheme data. Generate and update.

［形態素解析手段］
形態素解析手段１０５は、形態素解析用辞書生成手段１０４の生成または更新した形態素解析用辞書１１２を用いて形態素解析処理を実行する手段である。例えば、医学テキストを入力として受け取り、形態素解析用辞書生成手段１０４が生成した形態素解析用辞書１１２を用いることによって形態素解析を実行し、その解析結果を出力する。 [Morphological analysis means]
The morpheme analysis unit 105 is a unit that executes morpheme analysis processing using the morpheme analysis dictionary 112 generated or updated by the morpheme analysis dictionary generation unit 104. For example, medical text is received as input, morpheme analysis is executed by using the morpheme analysis dictionary 112 generated by the morpheme analysis dictionary generation unit 104, and the analysis result is output.

形態素解析用辞書生成手段１０４の生成した形態素解析用辞書１１２を用いることにより、例えば「潜在自己免疫性甲状腺炎」や「自己免疫性慢性甲状腺炎」を、それぞれ「潜在」「自己免疫性」「甲状腺炎」と「自己免疫性」「慢性」「甲状腺炎」に正しく分割することが可能となる。 By using the morphological analysis dictionary 112 generated by the morphological analysis dictionary generating means 104, for example, "latent autoimmune thyroiditis" and "autoimmune chronic thyroiditis" are respectively expressed as "latent", "autoimmunity", " It becomes possible to correctly divide into “thyroiditis”, “autoimmunity”, “chronic”, and “thyroiditis”.

以上の構成により、医学分野の専門用語辞書に含まれない「自己免疫性」のような形態素を含む専門用語であっても、高い精度で正しい形態素に分割することが可能となる。 With the above configuration, even a technical term including a morpheme such as “autoimmunity” that is not included in the medical terminology dictionary in the medical field can be divided into correct morphemes with high accuracy.

次に、図２に示すフローチャートを参照して、本発明の言語解析システムにおいて実行する処理のシーケンスについて説明する。図２に示す処理フローはステップＳ１０１〜Ｓ１０５の５つのステップによって構成されている。これらのステップ中、ステップＳ１０１〜Ｓ１０３の処理は、図１に示す一対多単語抽出手段１０３の処理であり、ステップＳ１０４の処理が形態素解析用辞書生成手段１０４の処理、ステップＳ１０４の処理が形態素解析手段１０５の処理に相当する。 Next, a processing sequence executed in the language analysis system of the present invention will be described with reference to the flowchart shown in FIG. The processing flow shown in FIG. 2 is composed of five steps S101 to S105. Among these steps, the processing of steps S101 to S103 is the processing of the one-to-many word extraction means 103 shown in FIG. 1, the processing of step S104 is the processing of the morpheme analysis dictionary generation means 104, and the processing of step S104 is the morpheme analysis means. This corresponds to the process 105.

まず、ステップＳ１０１において、一対多単語抽出手段１０３は、専門用語対訳辞書格納手段１０２中の日英対訳ペアから抽出される日本語専門用語に形態素解析処理を施す。具体的には、前述したように、専門用語対訳辞書格納手段１０２中の日英対訳ペアとして登録されたデータから、英語専門用語中の一つの単語に複数の日本語形態素が対応している対訳ペアを抽出して、抽出した対訳ペアに含まれる日本語を抽出して一対多単語抽出手段１０３の形態素解析手段１１１において形態素解析を実行する。 First, in step S101, the one-to-many word extraction unit 103 performs a morphological analysis process on the Japanese technical terms extracted from the Japanese-English parallel translation pairs in the technical term bilingual dictionary storage unit 102. Specifically, as described above, from the data registered as a Japanese-English bilingual translation pair in the technical term bilingual dictionary storage unit 102, a translation in which a plurality of Japanese morphemes correspond to one word in the English technical term. A pair is extracted, Japanese included in the extracted parallel translation pair is extracted, and morpheme analysis is performed in the morpheme analysis unit 111 of the one-to-multiword extraction unit 103.

このステップＳ１０１の処理は、例えば前述した処理例では、専門用語対訳辞書格納手段１０２中の日英対訳ペア、
［自己免疫性甲状腺炎−ａｕｔｏｉｍｍｕｎｅｔｈｙｒｏｉｄｉｔｉｓ］
この対訳ペアの日本語専門用語である「自己免疫性甲状腺炎」について、形態素解析処理を施す処理に相当する。まず、以下の解析結果を得る。
（１）［自己，名詞］
（２）［免疫，名詞］
（３）［性，接尾辞］
（４）［甲状腺，名詞］
（５）［炎，接尾辞］
さらに、形態素解析手段１１１は、上記の最小単位の形態素抽出結果から、データ処理において有効なデータ単位としての形態素を決定する。例えば、名詞に接尾辞が後接する場合は纏めて一つの形態素とみなす処理により、（２）＋（３）、および（４）＋（５）の纏め処理が実行されて、「自己免疫性甲状腺炎」は、
（ａ）「自己」
（ｂ）「免疫性」
（ｃ）「甲状腺炎」
の３つの形態素に分割される。 The processing of this step S101 is, for example, in the above-described processing example, a Japanese-English bilingual translation pair in the technical term bilingual dictionary storage means 102,
[Autoimmune thyroiditis-autoimmune thyroiditis]
This bilingual paired Japanese technical term “autoimmune thyroiditis” corresponds to a process of performing a morphological analysis process. First, the following analysis results are obtained.
(1) [Self, noun]
(2) [Immunity, Noun]
(3) [Gender, suffix]
(4) [Thyroid, noun]
(5) [Fire, Suffix]
Furthermore, the morpheme analyzing unit 111 determines a morpheme as an effective data unit in the data processing from the morpheme extraction result of the minimum unit. For example, when the suffix is suffixed to a noun, the summarization process of (2) + (3) and (4) + (5) is executed by the process of collectively considering it as one morpheme, and “autoimmune thyroid gland” "Fire"
(A) "Self"
(B) “Immunity”
(C) “Thyroiditis”
Are divided into three morphemes.

ステップＳ１０２では、一対多単語抽出手段１０３は、一般用語対訳辞書格納手段１０１および専門用語対訳辞書格納手段１０２中のそれぞれの対訳辞書を用いて対応する英単語の特定を行なう。上記処理例では、上記（ａ）〜（ｃ）の３つの形態素に対応する英単語の特定を行なう。すなわち、専門用語対訳辞書格納手段１０２に登録されている［日本語−英語］の対訳ペア、
［自己免疫性甲状腺炎−ａｕｔｏｉｍｍｕｎｅｔｈｙｒｏｉｄｉｔｉｓ」
において、「ａｕｔｏｉｍｍｕｎｅ」、「ｔｈｙｒｏｉｄｉｔｉｓ」の各々について、単独の英単語が対訳ペアとして登録されているか否かを判定し、専門用語対訳辞書格納手段１０２に、
「甲状腺炎−ｔｈｙｒｏｉｄｉｔｉｓ」
上記の対訳ペアが登録されていることを検出する。 In step S <b> 102, the one-to-many word extraction unit 103 specifies the corresponding English word using each bilingual dictionary in the general term bilingual dictionary storage unit 101 and the technical term bilingual dictionary storage unit 102. In the above processing example, English words corresponding to the three morphemes (a) to (c) are specified. That is, [Japanese-English] parallel translation pairs registered in the technical term bilingual dictionary storage means 102,
[Autoimmune thyroiditis-autoimmune thyroiditis]
, For each of “autoimmune” and “thyridotis”, it is determined whether or not a single English word is registered as a bilingual pair.
"Thyroiditis"
Detect that the above translation pair is registered.

ステップＳ１０３では、ステップＳ１０２における単語特定結果に基づいて、一般用語対訳辞書格納手段１０１および専門用語対訳辞書格納手段１０２に登録された対訳ペアからは直接対応付けのできない残りの英単語と日本語の形態素列との対応を判定する。 In step S103, based on the word identification result in step S102, the remaining English words and Japanese words that cannot be directly matched from the bilingual pairs registered in the general term bilingual dictionary storage unit 101 and the technical term bilingual dictionary storage unit 102 are displayed. The correspondence with the morpheme string is determined.

上記処理例では、一対多単語抽出手段１０３は、専門用語対訳辞書格納手段１０２または一般用語対訳辞書格納手段１０１に登録された対訳ペアとしての、
［自己免疫性甲状腺炎−ａｕｔｏｉｍｍｕｎｅｔｈｙｒｏｉｄｉｔｉｓ」
「甲状腺炎−ｔｈｙｒｏｉｄｉｔｉｓ」
これらの対訳ペアに基づいて、「ｔｈｙｒｏｉｄｉｔｉｓ」が、「甲状腺炎」に対応することを特定する。 In the above processing example, the one-to-many word extraction unit 103 is a bilingual pair registered in the technical term bilingual dictionary storage unit 102 or the general term bilingual dictionary storage unit 101.
[Autoimmune thyroiditis-autoimmune thyroiditis]
"Thyroiditis"
Based on these parallel translation pairs, it is specified that “thyroiditis” corresponds to “thyroiditis”.

一対多単語抽出手段１０３は、このステップＳ１０１〜Ｓ１０３の処理結果を、形態素解析用辞書生成手段１０４に出力する。 The one-to-many word extraction means 103 outputs the processing results of steps S101 to S103 to the morphological analysis dictionary generation means 104.

ステップＳ１０４は、形態素解析用辞書生成手段１０４の処理であり、一対多単語抽出手段１０３から入力する形態素列および対訳辞書中の単一の英単語に対応する日本語表現を形態素解析用辞書に追加する処理を実行する。具体的には、一般用語対訳辞書格納手段１０１中の日本語単語、専門用語対訳辞書格納手段１０２中の「英語専門用語が単一の英単語であるペア中の日本語専門用語」、および、一対多単語抽出手段１０３で抽出した「英語専門用語中の一単語に複数の日本語形態素が対応している日本語形態素列」（例えば「自己免疫性」）を、形態素データとして追加した形態素解析用辞書１１２を生成、更新する。 Step S104 is processing of the morpheme analysis dictionary generation unit 104, and adds a morpheme string input from the one-to-many word extraction unit 103 and a Japanese expression corresponding to a single English word in the bilingual dictionary to the morpheme analysis dictionary. Execute the process. Specifically, a Japanese word in the general term bilingual dictionary storage unit 101, a “Japanese technical term in a pair whose English technical term is a single English word” in the technical term bilingual dictionary storage unit 102, and For morpheme analysis in which a “Japanese morpheme sequence in which a plurality of Japanese morphemes correspond to one word in English technical terms” (for example, “autoimmunity”) extracted by the one-to-many word extraction unit 103 is added as morpheme data A dictionary 112 is generated and updated.

ステップＳ１０５の処理は、形態素解析手段１０５の処理であり、ステップｓ１０４において、形態素解析用辞書生成手段１０４の生成または更新した形態素解析用辞書１１２を用いて形態素解析処理を実行する手段である。例えば、医学テキストを入力として受け取り、形態素解析用辞書生成手段１０４が生成した形態素解析用辞書１１２を用いることによって形態素解析を実行し、その解析結果を出力する。 The process of step S105 is a process of the morpheme analysis unit 105, and is a unit that executes the morpheme analysis process using the morpheme analysis dictionary 112 generated or updated by the morpheme analysis dictionary generation unit 104 in step s104. For example, medical text is received as input, morpheme analysis is executed by using the morpheme analysis dictionary 112 generated by the morpheme analysis dictionary generation unit 104, and the analysis result is output.

なお、上述した実施例では、専門分野を医療分野とした例を説明したが、前述したように、本発明は医療分野に限らず、その他の専門分野、例えば経済、建築、技術などの様々な専門分野の専門用語の解析に適用可能である。各専門分野に対応する処理を行なう場合は、それぞれの専門分野に対応する専門用語対訳辞書を適用した処理を実行すればよい。 In the above-described embodiment, the example in which the specialized field is the medical field has been described. However, as described above, the present invention is not limited to the medical field, and various other specialized fields such as economy, architecture, technology, and the like. Applicable to analysis of technical terms in specialized fields. When processing corresponding to each specialized field is performed, processing applying a technical term bilingual dictionary corresponding to each specialized field may be executed.

なお、図１に示す言語解析システムの構成例では、一般用語対訳辞書格納手段１０１と、専門用語対訳辞書格納手段１０２とをそれぞれ独立した手段として構成した例を示しているが、一般用語対訳辞書格納手段１０１と、専門用語対訳辞書格納手段１０２とを融合した１つの辞書格納手段として構成してもよい。 The configuration example of the language analysis system shown in FIG. 1 shows an example in which the general term bilingual dictionary storage unit 101 and the technical term bilingual dictionary storage unit 102 are configured as independent units. The storage unit 101 and the technical term bilingual dictionary storage unit 102 may be combined into a single dictionary storage unit.

また、上述した実施例では、日本語と英語の対訳辞書を適用した処理例を説明したが、英語以外の言語と日本語との対訳辞書を適用する構成としてもよい。例えば日本語とドイツ語、日本語とフランス語などの対訳辞書を適用しても上記の実施例と同様の処理が実現できる。すなわち、単語がスペース記号によって区分されていることが明確な言語と日本語との対訳辞書を利用することで、上記の日英対訳辞書を適用した処理と同様の処理を行なうことができる。 In the above-described embodiment, a processing example in which a bilingual dictionary of Japanese and English is applied has been described. However, a configuration in which a bilingual dictionary of languages other than English and Japanese is applied may be adopted. For example, even when a bilingual dictionary such as Japanese and German or Japanese and French is applied, the same processing as in the above embodiment can be realized. That is, by using a bilingual dictionary of a language and Japanese in which words are clearly divided by space symbols, the same processing as the processing applying the above Japanese-English bilingual dictionary can be performed.

［２．その他の実施例］
以下、上述した基本実施例と異なる複数の変形実施例について説明する。 [2. Other Examples]
Hereinafter, a plurality of modified embodiments different from the basic embodiment described above will be described.

（２−１）変形実施例１
図１、図２を参照して説明した言語解析システムの処理例は、一対多単語抽出手段１０３が、専門用語対訳辞書格納手段１０２に登録された対訳ペアから抽出された日本語と英語のペアが、１つの日本語と複数の英単語のペアとして設定されているデータを抽出し、形態素解析手段１１１において日本語の形態素解析を実行する構成であり、この形態素解析は、既存の形態素解析用辞書を適用した形態素解析処理として実行する例として説明した。 (2-1) Modified Example 1
In the processing example of the language analysis system described with reference to FIGS. 1 and 2, the one-to-many word extraction unit 103 has Japanese and English pairs extracted from the bilingual pairs registered in the technical term bilingual dictionary storage unit 102. Data configured as a pair of one Japanese word and a plurality of English words is extracted, and the morpheme analysis unit 111 executes a Japanese morpheme analysis. This morpheme analysis is an existing morpheme analysis dictionary. It has been described as an example of execution as a morphological analysis process to which is applied.

この変形実施例１の言語解析システム２００は、図３に示すように、一対多単語抽出手段１０３の形態素解析手段１１１が実行する形態素解析に、形態素解析用生成手段が生成、更新した形態素解析用辞書１１２を適用する。すなわち、以前の解析処理において、対応関係の判明した日英用語ペアに基づいて形態素として登録されたデータを適用して、一対多単語抽出手段１０３の形態素解析手段１１１において新たな形態素解析を実行する。 As shown in FIG. 3, the language analysis system 200 according to the first modified example includes a morpheme analysis dictionary generated and updated by a morpheme analysis generation unit in the morpheme analysis performed by the morpheme analysis unit 111 of the one-to-multiword extraction unit 103. 112 is applied. That is, in the previous analysis process, data registered as morphemes based on the Japanese-English term pairs whose correspondence relationship has been found is applied, and the morpheme analysis unit 111 of the one-to-many word extraction unit 103 executes a new morpheme analysis.

例えば、先に実行した解析処理において、「ａｕｔｏｉｍｍｕｎｅ」が「自己」「免疫性」と対応することが特定でき、「自己免疫性」が形態素解析用辞書１１２に追加登録されている場合、その後の一対多単語抽出手段１０３の形態素解析手段１１１が実行する形態素解析では、「自己免疫性」を１つの形態素として即座に判定することが可能となり、処理の効率化が実現される。また、このように、再帰的に形態素列を纏め上げることによって、より漏れの少ない専門用語解析用形態素の抽出を実現することができる。 For example, in the analysis process executed earlier, it is possible to specify that “autoimmune” corresponds to “self” and “immunity”, and when “autoimmunity” is additionally registered in the morphological analysis dictionary 112, In the morpheme analysis performed by the morpheme analysis unit 111 of the one-to-many word extraction unit 103, it is possible to immediately determine “autoimmunity” as one morpheme, thereby realizing efficient processing. In addition, by recursively collecting morpheme strings in this way, it is possible to realize extraction of technical term analysis morphemes with less leakage.

（２−２）変形実施例２
上述した基本実施例においては、一対多単語抽出手段１０３の形態素解析手段１１１の形態素解析処理の実行に際して形態素の纏め処理を行なう場合、形態素解析によって得られた品詞情報に基づく纏め処理を実行していた。具体的には、名詞に接尾辞が後接する場合は纏めて一つの形態素とみなす処理を行い、
（１）［自己，名詞］
（２）［免疫，名詞］
（３）［性，接尾辞］
（４）［甲状腺，名詞］
（５）［炎，接尾辞］
これらの形態素を、
（ａ）「自己」
（ｂ）「免疫性」
（ｃ）「甲状腺炎」
の３つの形態素として纏める処理を実行する例を説明した。 (2-2) Modification Example 2
In the basic embodiment described above, when the morpheme analysis process is performed by the morpheme analysis unit 111 of the one-to-multiword extraction unit 103, the grouping process based on the part of speech information obtained by the morpheme analysis is performed. . Specifically, if the suffix is followed by a noun, the process is considered as a single morpheme,
(1) [Self, noun]
(2) [Immunity, Noun]
(3) [Gender, suffix]
(4) [Thyroid, noun]
(5) [Fire, Suffix]
These morphemes,
(A) "Self"
(B) “Immunity”
(C) “Thyroiditis”
The example which performs the process put together as these three morphemes was demonstrated.

この形態素の纏め処理は、品詞解析による纏め処理例であるが、専門用語対訳辞書格納手段に登録された対訳ペアを解析して、複数の最小形態素（例えば上記（１）〜（５）の形態素）の複数の組み合わせ列（例えば（１）〜（３）の「自己免疫性」）が出現する回数をカウントして、予め設定した閾値以上の出現回数がある形態素列を１つの形態素として決定する処理を行なう構成としてもよい。すなわち、閾値を「２」とした場合、上述の例のように「自己免疫性」が２回纏め上げられている例が検出された場合、「自己免疫性」を形態素解析用辞書に追加する形態素データとして設定する。１回しか纏め上げられなかった形態素列は追加しない。 This morpheme summarization processing is an example of summarization processing by part-of-speech analysis. However, by analyzing the bilingual pairs registered in the technical term bilingual dictionary storage means, a plurality of minimum morphemes (for example, the morphemes (1) to (5) above) are analyzed. ) Of a plurality of combination sequences (for example, “autoimmunity” of (1) to (3)) is counted, and a morpheme sequence having an appearance count equal to or greater than a preset threshold is determined as one morpheme. It is good also as a structure which performs a process. That is, when the threshold is set to “2”, “autoimmunity” is added to the morphological analysis dictionary when an example in which “autoimmunity” is summarized twice as in the above example is detected. Set as morpheme data. A morpheme string that has been collected only once is not added.

なお、専門用語対訳辞書格納手段に登録された対訳ペアの解析ではなく、例えば専門分野の文書データベースの格納文書の解析によって同様に出現回数のカウントを行なって、閾値との比較に基づいて纏め処理を実行してもよい。このような纏め処理を行なうことで、より精度の高い専門用語解析用形態素の抽出を実現することができる。 In addition, instead of analyzing the bilingual pair registered in the technical term bilingual dictionary storage means, for example, the number of appearances is similarly counted by analyzing the stored document in the document database of the specialized field, and the summary processing is performed based on the comparison with the threshold value. May be executed. By performing such summarization processing, it is possible to realize more accurate extraction of technical term analysis morphemes.

（２−３）変形実施例３
変形実施例３に係る言語解析システム３００の構成例を図４に示す。図４に示すように、変形実施例３に係る言語解析システム３００は、英語コーバス格納手段３０１を有する。英語コーバス格納手段３０１は、英語の医学分野のテキスト集合を格納している。 (2-3) Modified Example 3
FIG. 4 shows a configuration example of the language analysis system 300 according to the third modified example. As shown in FIG. 4, the language analysis system 300 according to the modified example 3 includes an English cobus storage unit 301. The English corbus storage means 301 stores a text set in the English medical field.

本構成例においては、一対多単語抽出手段１０３が、専門用語対訳辞書格納手段１０２中の英語専門用語中の連続する２つ（あるいはそれ以上）の英単語のうち、英語コーパス格納手段３０１中の英語テキスト集合において頻出する英単語列を特定し、該英単語列を一つの英語の単語とみなして対応する日本語形態素列を抽出する。 In the present configuration example, the one-to-many word extraction unit 103 selects the English in the English corpus storage unit 301 among two (or more) consecutive English words in the English technical term in the technical term bilingual dictionary storage unit 102. An English word sequence that frequently appears in the text set is specified, and the corresponding English morpheme sequence is extracted by regarding the English word sequence as one English word.

すなわち、前述の基本実施例では、１つの英単語に対応する形態素列を１つの形態素として設定する処理を行っていたが、本構成例では、複数の英単語であっても、英語コーパス格納手段３０１中の英語テキスト集合において頻出する英単語列である場合は、この英単語列を一つの英語の単語とみなして、この英単語列に対応する日本語形態素列を抽出し、これを１つの形態素として判断し、形態素解析用辞書生成手段１０４に出力して形態素解析用辞書１１２に登録する。 That is, in the above-described basic embodiment, the process of setting a morpheme sequence corresponding to one English word as one morpheme is performed. However, in this configuration example, even if there are a plurality of English words, the English corpus storage means In the case of an English word sequence frequently appearing in the English text set in 301, the English word sequence is regarded as one English word, and a Japanese morpheme sequence corresponding to the English word sequence is extracted. The morpheme is determined as a morpheme, output to the morpheme analysis dictionary generation means 104 and registered in the morpheme analysis dictionary 112.

本構成例によれば、より漏れの少ない専門用語解析用形態素の抽出を実現することができる。なお、図には英語コーバス格納手段３０１として示しているが、前述したように英語以外の言語を適用することも可能であり、他の日本語以外の言語、例えばドイツ語、フランス語などのスペースによって単語の区切りが明確な言語に対応するコーパス格納手段を設定し、各対訳辞書格納手段にも同じ言語の対訳辞書を設定した構成としてもよい。 According to this configuration example, extraction of technical term analysis morphemes with less leakage can be realized. In the figure, although shown as the English corbus storage means 301, it is possible to apply a language other than English as described above, and other languages other than Japanese, for example, German, French, etc. A corpus storage unit corresponding to a language with a clear word break may be set, and a bilingual dictionary of the same language may be set for each bilingual dictionary storage unit.

最後に、上述した処理を実行する言語解析システムを構成する情報処理装置のハードウェア構成例について、図５を参照して説明する。ＣＰＵ（Central Processing Unit）５０１は、ＯＳ（Operating System)に対応する処理や、上述の実施例において説明した一対多単語抽出処理、形態素解析処理、形態素解析用辞書生成処理などを実行する。これらの処理は、各情報処理装置のＲＯＭ、ハードディスクなどのデータ記憶部に格納されたコンピュータ・プログラムに従って実行される。 Finally, a hardware configuration example of the information processing apparatus that configures the language analysis system that executes the above-described processing will be described with reference to FIG. A CPU (Central Processing Unit) 501 executes processing corresponding to an OS (Operating System), one-to-multiword extraction processing, morpheme analysis processing, morpheme analysis dictionary generation processing, and the like described in the above embodiments. These processes are executed according to a computer program stored in a data storage unit such as a ROM or a hard disk of each information processing apparatus.

ＲＯＭ（Read Only Memory）５０２は、ＣＰＵ５０１が使用するプログラムや演算パラメータ等を格納する。ＲＡＭ（Random Access Memory）５０３は、ＣＰＵ５０１の実行において使用するプログラムや、その実行において適宜変化するパラメータ等を格納する。これらはＣＰＵバスなどから構成されるホストバス５０４により相互に接続されている。 A ROM (Read Only Memory) 502 stores programs used by the CPU 501, calculation parameters, and the like. A RAM (Random Access Memory) 503 stores programs used in the execution of the CPU 501, parameters that change as appropriate during the execution, and the like. These are connected to each other by a host bus 504 including a CPU bus.

ホストバス５０４は、ブリッジ５０５を介して、ＰＣＩ(Peripheral Component Interconnect/Interface)バスなどの外部バス５０６に接続されている。 The host bus 504 is connected to an external bus 506 such as a PCI (Peripheral Component Interconnect / Interface) bus via a bridge 505.

キーボード５０８、ポインティングデバイス５０９は、ユーザにより操作される入力デバイスである。ディスプレイ５１０は、液晶表示装置またはＣＲＴ（Cathode Ray Tube）などから成り、各種情報をテキストやイメージで表示する。 A keyboard 508 and a pointing device 509 are input devices operated by the user. The display 510 includes a liquid crystal display device, a CRT (Cathode Ray Tube), or the like, and displays various types of information as text and images.

ＨＤＤ（Hard Disk Drive）５１１は、ハードディスクを内蔵し、ハードディスクを駆動し、ＣＰＵ５０１によって実行するプログラムや情報を記録または再生させる。ハードディスクは、例えば各種辞書、すなわち、前述した実施例において説明した一般用語対訳辞書、専門用語対訳辞書、形態素解析用辞書の格納手段に利用され、さらに、データ処理プログラム等、各種コンピュータ・プログラムが格納される。 An HDD (Hard Disk Drive) 511 includes a hard disk, drives the hard disk, and records or reproduces a program executed by the CPU 501 and information. The hard disk is used for storing various dictionaries, that is, general term bilingual dictionaries, technical term bilingual dictionaries, morphological analysis dictionaries described in the above-described embodiments, and further stores various computer programs such as data processing programs. Is done.

ドライブ５１２は、装着されている磁気ディスク、光ディスク、光磁気ディスク、または半導体メモリ等のリムーバブル記録媒体５２１に記録されているデータまたはプログラムを読み出して、そのデータまたはプログラムを、インタフェース５０７、外部バス５０６、ブリッジ５０５、およびホストバス５０４を介して接続されているＲＡＭ５０３に供給する。 The drive 512 reads data or a program recorded on a removable recording medium 521 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, and the data or program is read out from the interface 507 and the external bus 506. , And supplied to the RAM 503 connected via the bridge 505 and the host bus 504.

接続ポート５１４は、外部接続機器５２２を接続するポートであり、ＵＳＢ，ＩＥＥＥ１３９４等の接続部を持つ。接続ポート５１４は、インタフェース５０７、および外部バス５０６、ブリッジ５０５、ホストバス５０４等を介してＣＰＵ５０１等に接続されている。通信部５１５は、ネットワークに接続され、各種データベースや他の情報処理装置との通信を実行する。 The connection port 514 is a port for connecting the external connection device 522 and has a connection unit such as USB or IEEE1394. The connection port 514 is connected to the CPU 501 and the like via the interface 507, the external bus 506, the bridge 505, the host bus 504, and the like. The communication unit 515 is connected to a network and executes communication with various databases and other information processing apparatuses.

なお、図５に示す言語解析システムとしての情報処理装置のハードウェア構成例は、ＰＣを適用して構成した装置の一例であり、本発明の言語解析システムは、図５に示す構成に限らず、上述した実施例において説明した処理を実行可能な構成であればよい。 Note that the hardware configuration example of the information processing apparatus as the language analysis system shown in FIG. 5 is an example of an apparatus configured by applying a PC, and the language analysis system of the present invention is not limited to the configuration shown in FIG. Any configuration capable of executing the processing described in the above-described embodiments may be used.

以上、特定の実施例を参照しながら、本発明について詳解してきた。しかしながら、本発明の要旨を逸脱しない範囲で当業者が該実施例の修正や代用を成し得ることは自明である。すなわち、例示という形態で本発明を開示してきたのであり、限定的に解釈されるべきではない。本発明の要旨を判断するためには、特許請求の範囲の欄を参酌すべきである。 The present invention has been described in detail above with reference to specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiments without departing from the gist of the present invention. In other words, the present invention has been disclosed in the form of exemplification, and should not be interpreted in a limited manner. In order to determine the gist of the present invention, the claims should be taken into consideration.

なお、明細書中において説明した一連の処理はハードウェア、またはソフトウェア、あるいは両者の複合構成によって実行することが可能である。ソフトウェアによる処理を実行する場合は、処理シーケンスを記録したプログラムを、専用のハードウェアに組み込まれたコンピュータ内のメモリにインストールして実行させるか、あるいは、各種処理が実行可能な汎用コンピュータにプログラムをインストールして実行させることが可能である。 The series of processes described in the specification can be executed by hardware, software, or a combined configuration of both. When executing processing by software, the program recording the processing sequence is installed in a memory in a computer incorporated in dedicated hardware and executed, or the program is executed on a general-purpose computer capable of executing various processing. It can be installed and run.

例えば、プログラムは記録媒体としてのハードディスクやＲＯＭ（Read Only Memory)に予め記録しておくことができる。あるいは、プログラムはフレキシブルディスク、ＣＤ−ＲＯＭ(Compact Disc Read Only Memory)，ＭＯ(Magneto optical)ディスク，ＤＶＤ(Digital Versatile Disc)、磁気ディスク、半導体メモリなどのリムーバブル記録媒体に、一時的あるいは永続的に格納（記録）しておくことができる。このようなリムーバブル記録媒体は、いわゆるパッケージソフトウエアとして提供することができる。 For example, the program can be recorded in advance on a hard disk or ROM (Read Only Memory) as a recording medium. Alternatively, the program is temporarily or permanently stored on a removable recording medium such as a flexible disk, a CD-ROM (Compact Disc Read Only Memory), an MO (Magneto optical) disk, a DVD (Digital Versatile Disc), a magnetic disk, or a semiconductor memory. It can be stored (recorded). Such a removable recording medium can be provided as so-called package software.

なお、プログラムは、上述したようなリムーバブル記録媒体からコンピュータにインストールする他、ダウンロードサイトから、コンピュータに無線転送したり、ＬＡＮ(Local Area Network)、インターネットといったネットワークを介して、コンピュータに有線で転送し、コンピュータでは、そのようにして転送されてくるプログラムを受信し、内蔵するハードディスク等の記録媒体にインストールすることができる。 The program is installed on the computer from the removable recording medium as described above, or is wirelessly transferred from the download site to the computer, or is wired to the computer via a network such as a LAN (Local Area Network) or the Internet. The computer can receive the program transferred in this manner and install it on a recording medium such as a built-in hard disk.

なお、明細書に記載された各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。また、本明細書においてシステムとは、複数の装置の論理的集合構成であり、各構成の装置が同一筐体内にあるものには限らない。 Note that the various processes described in the specification are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Further, in this specification, the system is a logical set configuration of a plurality of devices, and the devices of each configuration are not limited to being in the same casing.

以上、説明したように、本発明の構成によれば、日本語と外国語との対訳辞書の登録データから日本語と外国語の対訳ペアが一単語対一単語と異なる一単語対多単語の登録データを持つ対訳ペアを抽出し、抽出した対訳ペアに含まれる日本語の形態素解析を行って複数の区分語を生成し、生成した区分語または区分語列に対応する外国語の単語を対訳辞書の登録データに基づいて取得し、取得した外国語の単語に対応する区分語または区分語列を形態素として形態素解析用辞書に登録し、登録した形態素情報に基づく形態素解析を行なうことを可能とした。本構成によれば、例えば医療分野などの専門分野の用語であっても、正しい区分に基づく形態素登録情報の生成、利用が可能となり、正確な形態素解析処理が実現される。 As described above, according to the configuration of the present invention, from the registration data of the bilingual dictionary of Japanese and foreign languages, the translation pairs of Japanese and foreign languages are different from one word to one word. Extract bilingual pairs with registered data, perform Japanese morphological analysis in the extracted bilingual pairs to generate multiple segment words, and translate the foreign language words corresponding to the generated segment words or segment word strings It is possible to acquire based on the registered data in the dictionary, register the segment word or segment word string corresponding to the acquired foreign language word as a morpheme in the dictionary for morpheme analysis, and perform morpheme analysis based on the registered morpheme information did. According to this configuration, for example, even in terms of a specialized field such as a medical field, morpheme registration information based on the correct classification can be generated and used, and an accurate morpheme analysis process is realized.

本発明の言語解析システムの基本構成例を示す図である。It is a figure which shows the basic structural example of the language analysis system of this invention. 本発明の言語解析システムの実行する処理シーケンスを説明するフローチャートを示す図である。It is a figure which shows the flowchart explaining the process sequence which the language analysis system of this invention performs. 本発明の言語解析システムの一実施例の構成を示す図である。It is a figure which shows the structure of one Example of the language analysis system of this invention. 本発明の言語解析システムの一実施例の構成を示す図である。It is a figure which shows the structure of one Example of the language analysis system of this invention. 本発明の一実施形態に係る言語解析システムのハードウェア構成例について説明する図である。It is a figure explaining the hardware structural example of the language analysis system which concerns on one Embodiment of this invention.

Explanation of symbols

１００言語解析システム
１０１一般用語対訳辞書格納手段
１０２専門用語対訳辞書格納手段
１０３一対多単語抽出手段
１０４形態素解析用辞書生成手段
１０５形態素解析手段
１１１形態素解析手段
１１２形態素解析手段
３０１英語コーパス格納手段
５０１ＣＰＵ(Central Processing Unit)
５０２ＲＯＭ（Read-Only-Memory）
５０３ＲＡＭ（Random Access Memory）
５０４ホストバス
５０５ブリッジ
５０６外部バス
５０７インタフェース
５０８キーボード
５０９ポインティングデバイス
５１０ディスプレイ
５１１ＨＤＤ（Hard Disk Drive）
５１２ドライブ
５１４接続ポート
５１５通信部
５２１リムーバブル記録媒体
５２２外部接続機器 DESCRIPTION OF SYMBOLS 100 Language analysis system 101 General term bilingual dictionary storage means 102 Technical term bilingual dictionary storage means 103 One-to-many word extraction means 104 Morphological analysis dictionary generation means 105 Morphological analysis means 111 Morphological analysis means 112 Morphological analysis means 301 English corpus storage means 501 CPU ( Central Processing Unit)
502 ROM (Read-Only-Memory)
503 RAM (Random Access Memory)
504 Host bus 505 Bridge 506 External bus 507 Interface 508 Keyboard 509 Pointing device 510 Display 511 HDD (Hard Disk Drive)
512 drive 514 connection port 515 communication unit 521 removable recording medium 522 external connection device

Claims

A bilingual dictionary storage means for storing bilingual dictionaries of Japanese and foreign languages other than Japanese,
Based on the registered bilingual dictionary data, one-word-to-multi-word bilingual pairs whose Japanese-foreign language bilingual pairs are different from one-word to one-word are extracted, and based on the Japanese morphological analysis included in the extracted bilingual pairs A one-to-many word extracting unit that generates a plurality of segment words by dividing processing, and that analyzes the foreign language word corresponding to the generated segment word or segment word string based on registration data of the bilingual dictionary;
A morpheme analysis dictionary generation unit that executes processing for registering, in the morpheme analysis dictionary, Japanese corresponding to the segmented word or segmented word string determined to correspond to the foreign language word in the one-to-many word extraction unit; ,
A language analysis system comprising:

The language analysis system further includes:
2. The language analysis system according to claim 1, further comprising a morpheme analysis unit that executes a morpheme analysis to which the morpheme analysis dictionary generated by the morpheme analysis dictionary generation unit is applied.

The one-to-many word extraction means includes
For a bilingual pair in which the bilingual pair of Japanese and foreign language included in the registered data of the bilingual dictionary is a one-to-one bilingual pair, the Japanese word included in the bilingual pair is output to the dictionary generating unit for morphological analysis. ,
The morphological analysis dictionary generation means includes:
The language analysis system according to claim 1, wherein the language analysis system is configured to execute a process of registering the Japanese word input from the one-to-many word extraction unit in a morphological analysis dictionary.

The bilingual dictionary storage means includes:
A general term bilingual dictionary that stores a bilingual dictionary of Japanese and foreign languages for general terms, and a technical term bilingual dictionary that stores bilingual dictionaries of Japanese and foreign languages for technical terms,
The one-to-many word extraction means includes
From the data registered in the technical term bilingual dictionary, a bilingual pair of one word to many words in which a bilingual pair of Japanese and a foreign language is different from one word to one word is extracted, and the Japanese translation included in the extracted bilingual pair A plurality of segment words obtained by dividing Japanese by morphological analysis are generated, and the words in the foreign language corresponding to the generated individual segment words or segment word strings are stored in both the technical term bilingual dictionary and the general term bilingual dictionary. The language analysis system according to claim 1, wherein the language analysis system is configured to execute a process of analyzing based on registered data.

The technical term bilingual dictionary is a bilingual dictionary storing technical terms in the medical field,
The one-to-many word extraction means includes
A plurality of segment words are generated by morphological analysis of Japanese which is a technical term in the medical field, and the foreign word corresponding to the generated segment word or segment word string is converted into the technical term bilingual dictionary and the general term bilingual dictionary. 5. The language analysis system according to claim 4, wherein the analysis processing is executed based on both of the registered data.

The bilingual dictionary storage means is configured to store a bilingual dictionary of Japanese and English,
One-to-many word extraction means
From the data registered in the bilingual dictionary, a bilingual pair of one-word-multiple words whose Japanese-English bilingual pair is different from one-word-one-word is extracted, and by morphological analysis of Japanese included in the extracted bilingual pair It is configured to generate a plurality of segment words obtained by dividing Japanese, and to perform processing for analyzing English words corresponding to the generated segment words or segment word strings based on registration data of the bilingual dictionary The language analysis system according to claim 1.

The one-to-many word extraction means includes
Generate multiple segment words by dividing Japanese by Japanese morpheme analysis, and apply segment word summary processing to the generated segment words by applying the part-of-speech information of the segment words analyzed in the morpheme analysis. The language analysis system according to claim 1, wherein the language analysis system is configured to generate a word string and analyze a foreign language word corresponding to the segmented word string.

The one-to-many word extraction means includes
Generating a plurality of segment words obtained by dividing Japanese by morphological analysis of Japanese, and when generating a segment word string based on the generated segment words, analyzing the appearance frequency of the segment word string in the bilingual dictionary or other database, 2. The language analysis system according to claim 1, wherein the language analysis system is configured to execute processing for analyzing a corresponding foreign language word for a segment word string having a high appearance frequency.

The language analysis system further includes:
A foreign language corpus storage means for storing the foreign language text data;
The one-to-many word extraction means includes
In the foreign language word analysis processing corresponding to the segmented word or segmented word string generated by the Japanese morphological analysis, the foreign language text in the corpus storage means is referred to, and the word string that frequently appears in the foreign language text is 2. The language analysis system according to claim 1, wherein the language analysis system is configured to perform analysis processing that identifies and regards the word string as one foreign language word.

A language analysis method for executing language analysis processing in a language analysis system,
The one-to-many word extracting means refers to a bilingual dictionary storing a bilingual dictionary of Japanese and a foreign language other than Japanese, and the bilingual translation pair of Japanese and foreign language is one word to one word from the registered data of the bilingual dictionary. The bilingual pairs of different one-word vs. multi-words are extracted, a plurality of segment words are generated by the division processing based on the Japanese morphological analysis included in the extracted bi-translation pairs, and the corresponding segment words or segment word strings are A one-to-many word extraction step of analyzing a foreign language word based on registration data of the bilingual dictionary;
The morpheme analysis dictionary generating means executes a process of registering, in the morpheme analysis dictionary, Japanese corresponding to the segmented word or segmented word sequence determined to correspond to the foreign language word in the one-to-many word extraction step. A morphological analysis dictionary generation step to perform,
A language analysis method characterized by comprising:

The language analysis method further includes:
A morpheme analysis step, wherein the morpheme analysis unit executes a morpheme analysis to which the morpheme analysis dictionary generated in the morpheme analysis dictionary generation step is applied;
The language analysis method according to claim 10, comprising:

The one-to-many word extraction step includes:
For a bilingual pair in which the bilingual pair of Japanese and foreign language included in the registration data of the bilingual dictionary is a one-to-one bilingual pair, the Japanese word included in the bilingual pair is output to the dictionary generating unit for morphological analysis Including steps,
The morphological analysis dictionary generation step includes:
The language analysis method according to claim 10, wherein a process of registering the Japanese word input in the one-to-many word extraction step in a morphological analysis dictionary is executed.

The bilingual dictionary includes a general term bilingual dictionary storing a bilingual dictionary of Japanese and foreign languages for general terms, and a technical term bilingual dictionary storing a bilingual dictionary of Japanese and foreign languages for technical terms. ,
The one-to-many word extraction step includes:
From the data registered in the technical term bilingual dictionary, a bilingual pair of one word to many words in which a bilingual pair of Japanese and a foreign language is different from one word to one word is extracted, and the Japanese translation included in the extracted bilingual pair A plurality of segment words obtained by dividing Japanese by morphological analysis are generated, and the words in the foreign language corresponding to the generated individual segment words or segment word strings are stored in both the technical term bilingual dictionary and the general term bilingual dictionary. The language analysis method according to claim 10, wherein the language analysis method is a step of executing a process of analyzing based on registered data.

The technical term bilingual dictionary is a bilingual dictionary storing technical terms in the medical field,
The one-to-many word extraction step includes:
A plurality of segment words are generated by morphological analysis of Japanese which is a technical term in the medical field, and the foreign word corresponding to the generated segment word or segment word string is converted into the technical term bilingual dictionary and the general term bilingual dictionary. The language analysis method according to claim 13, wherein the language analysis method is a step of executing a process of analyzing based on both of the registered data.

The bilingual dictionary is a bilingual dictionary of Japanese and English,
The one-to-many word extraction step
From the data registered in the bilingual dictionary, a bilingual pair of one-word-multiple words whose Japanese-English bilingual pair is different from one-word-one-word is extracted, and by morphological analysis of Japanese included in the extracted bilingual pair A step of generating a plurality of segment words obtained by dividing Japanese and executing processing for analyzing English words corresponding to the generated segment words or segment word strings based on registration data of the bilingual dictionary The language analysis method according to claim 10.

The one-to-many word extraction step includes:
Generate multiple segment words by dividing Japanese by Japanese morpheme analysis, and apply segment word summary processing to the generated segment words by applying the part-of-speech information of the segment words analyzed in the morpheme analysis. 11. The language analysis method according to claim 10, wherein the language analysis method is a step of generating a word string and analyzing a foreign language word corresponding to the segmented word string.

The one-to-many word extraction step includes:
Generating a plurality of segment words obtained by dividing Japanese by morphological analysis of Japanese, and when generating a segment word string based on the generated segment words, analyzing the appearance frequency of the segment word string in the bilingual dictionary or other database, The language analysis method according to claim 10, which is a step of executing processing for analyzing a corresponding foreign language word for a segment word string having a high appearance frequency.

The one-to-many word extraction step includes:
In the foreign language word analysis processing corresponding to the segmented word or segmented word string generated by Japanese morphological analysis, referring to the foreign language text stored in the foreign language corpus storage means storing the foreign language text data, The language analysis method according to claim 10, wherein a word string that appears frequently in the foreign language text is specified, and an analysis process is performed in which the word string is regarded as one foreign language word.

A computer program that executes language analysis processing in a language analysis system,
The one-to-many word extraction means refers to a bilingual dictionary storing a bilingual dictionary of Japanese and a foreign language other than Japanese, and the bilingual translation pair of Japanese and foreign language is converted into a one-to-one word from the registered data of the bilingual dictionary. A pair of different one-word-to-multi-word translations is extracted, and a plurality of segment words are generated by a segmentation process based on Japanese morphological analysis included in the extracted pair of translation pairs, and corresponding segment words or segment word strings are generated. A one-to-many word extraction step for analyzing the foreign language word based on registration data of the bilingual dictionary;
The morpheme analysis dictionary generating means executes a process of registering, in the morpheme analysis dictionary, Japanese corresponding to the segmented word or segmented word sequence determined to correspond to the foreign language word in the one-to-many word extraction step. A morphological analysis dictionary generation step to be performed;
A computer program characterized by comprising: