JP6379666B2

JP6379666B2 - Document analysis apparatus, document analysis program, and document analysis method

Info

Publication number: JP6379666B2
Application number: JP2014105221A
Authority: JP
Inventors: 育昌鄭; 友樹長瀬
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2014-05-21
Filing date: 2014-05-21
Publication date: 2018-08-29
Anticipated expiration: 2034-05-21
Also published as: JP2015219861A

Description

本発明は、文書解析装置、文書解析プログラム及び文書解析方法に関する。 The present invention relates to a document analysis apparatus, a document analysis program, and a document analysis method.

従来、中国語などの品詞による語形変換がない言語を機械翻訳する場合、品詞の曖昧性により、名詞句を一つの単位として認識できず、誤った翻訳がなされる場合があった。これに対し、最近では、機械翻訳の際に、解析済コーパスデータベースを用いた構文解析を行う技術が知られている（例えば、特許文献１参照）。 Conventionally, when a language such as Chinese that does not have a word form conversion by part of speech is machine-translated, a noun phrase cannot be recognized as one unit due to the ambiguity of the part of speech, and there is a case where an incorrect translation is performed. On the other hand, recently, a technique for performing syntax analysis using an analyzed corpus database at the time of machine translation is known (for example, see Patent Document 1).

特許文献１においては、文法規則や統計手法に基づいて文を単語ごとに切って形態素に分解し、形態素に基づいて文の構造を解析する。この解析の結果、正しい構文解析結果となりうる候補が複数存在する場合には、解析済コーパスデータベースに含まれるコーパスとの類似度に基づいて、複数の候補から正しい構文解析結果を決定する。 In Patent Document 1, a sentence is cut into words based on grammatical rules and statistical methods and broken down into morphemes, and the sentence structure is analyzed based on the morphemes. As a result of this analysis, when there are a plurality of candidates that can be a correct syntax analysis result, a correct syntax analysis result is determined from the plurality of candidates based on the similarity to the corpus included in the analyzed corpus database.

特開２００７−２４１７６４号公報JP 2007-241664 A

しかしながら、翻訳対象の文には様々な構造があるため、上述したようにコーパスとの類似度に基づいて構文解析を行ったとしても、正しい構文解析結果を得られない場合があった。 However, since the sentence to be translated has various structures, there is a case where a correct parsing result cannot be obtained even if the parsing is performed based on the similarity with the corpus as described above.

１つの側面では、本発明は、文における名詞句の範囲を精度よく解析することが可能な文書解析装置、文書解析プログラム及び文書解析方法を提供することを目的とする。 In one aspect, an object of the present invention is to provide a document analysis apparatus, a document analysis program, and a document analysis method that can accurately analyze a range of noun phrases in a sentence.

一つの態様では、文書解析装置は、文書内に複数回出現する単語群を名詞句候補として仮定する仮定部と、構文リストに含まれる複数の正しい構文のうち、前記名詞句候補を含む複数の文と類似する構文をそれぞれ抽出する抽出部と、前記抽出部が前記名詞句候補を含む複数の文に類似する構文を抽出できたか否かに基づいて、前記名詞句候補の仮定が正しいか否かを判断する判断部と、を備えている。
In one aspect, the document analysis apparatus includes a hypothesis unit that assumes a word group that appears multiple times in a document as a noun phrase candidate, and a plurality of correct syntaxes included in a syntax list, An extraction unit that extracts a syntax similar to a sentence, and whether the noun phrase candidate assumption is correct based on whether the extraction unit has extracted a syntax similar to a plurality of sentences including the noun phrase candidate. And a determination unit for determining whether or not.

一つの態様では、文書解析プログラムは、文書内に複数回出現する単語群を名詞句候補として仮定し、構文リストに含まれる複数の正しい構文のうち、前記名詞句候補を含む複数の文と類似する構文をそれぞれ抽出し、前記抽出する処理において前記名詞句候補を含む複数の文に類似する構文を抽出できたか否かに基づいて、前記名詞句候補の仮定が正しいか否かを判断する、処理をコンピュータに実行させる文書解析プログラムである。
In one aspect, the document analysis program assumes a group of words appearing multiple times in a document as a noun phrase candidate, and is similar to a plurality of sentences including the noun phrase candidate among a plurality of correct syntaxes included in a syntax list. To determine whether or not the assumption of the noun phrase candidate is correct based on whether or not syntax similar to a plurality of sentences including the noun phrase candidate can be extracted in the extracting process . A document analysis program that causes a computer to execute processing.

一つの態様では、文書解析方法は、仮定部が、文書内に複数回出現する単語群を名詞句候補として仮定し、抽出部が、構文リストに含まれる複数の正しい構文のうち、前記名詞句候補を含む複数の文と類似する構文をそれぞれ抽出し、判断部が、前記抽出部が前記名詞句候補を含む複数の文に類似する構文を抽出できたか否かに基づいて、前記名詞句候補の仮定が正しいか否かを判断する、処理を実行する文書解析方法である。 In one aspect, in the document analysis method, the assumption unit assumes a group of words appearing multiple times in the document as a noun phrase candidate, and the extraction unit includes the noun phrase among a plurality of correct syntaxes included in the syntax list. Each of the noun phrase candidates is extracted based on whether or not the syntax similar to a plurality of sentences including candidates is extracted, and the determination unit can extract the syntax similar to the plurality of sentences including the noun phrase candidates. This is a document analysis method for executing processing to determine whether or not the assumption is correct.

文における名詞句の範囲を精度よく解析することができる。 The range of noun phrases in sentences can be analyzed with high accuracy.

第１の実施形態に係る翻訳端末のハードウェア構成を概略的に示す図である。It is a figure which shows roughly the hardware constitutions of the translation terminal which concerns on 1st Embodiment. 翻訳端末の機能ブロック図である。It is a functional block diagram of a translation terminal. 構文解析済みコーパスの一例を示す図である。It is a figure which shows an example of a corpus after parsing. 第１の実施形態における解析部の処理を示すフローチャートである。It is a flowchart which shows the process of the analysis part in 1st Embodiment. 解析対象の文書の一例を示す図である。It is a figure which shows an example of the document of analysis object. 図４のステップＳ１２の処理を示すフローチャートである。It is a flowchart which shows the process of FIG.4 S12. 図７（ａ）〜図７（ｃ）は、名詞句候補一時保存部に一時保存されるデータの一例を示す図である。Fig.7 (a)-FIG.7 (c) are figures which show an example of the data temporarily preserve | saved at a noun phrase candidate temporary storage part. 図８（ａ）〜図８（ｄ）は、名詞句候補ＩＤ（ＣＡＮＤＩ−１）の名詞句候補文字列の処理例を説明するための図である。FIG. 8A to FIG. 8D are diagrams for explaining a processing example of the noun phrase candidate character string of the noun phrase candidate ID (CANDI-1). 図４のステップＳ１８の処理を示すフローチャートである。It is a flowchart which shows the process of step S18 of FIG. 図１０（ａ）〜図１０（ｃ）は、名詞句候補ＩＤ（ＣＡＮＤＩ−２）の名詞句候補文字列の処理例を説明するための図（その１）である。FIG. 10A to FIG. 10C are diagrams (No. 1) for explaining a processing example of the noun phrase candidate character string of the noun phrase candidate ID (CANDI-2). 図１１（ａ）〜図１１（ｃ）は、名詞句候補ＩＤ（ＣＡＮＤＩ−２）の名詞句候補文字列の処理例を説明するための図（その２）である。FIGS. 11A to 11C are diagrams (No. 2) for explaining a processing example of the noun phrase candidate character string of the noun phrase candidate ID (CANDI-2). 図４のステップＳ２０の処理を示すフローチャートである。It is a flowchart which shows the process of step S20 of FIG. 図１３（ａ）、図１３（ｂ）は、名詞句リストの一例を示す図である。FIG. 13A and FIG. 13B are diagrams showing examples of noun phrase lists. 図１４（ａ）、図１４（ｂ）は、名詞句候補ＩＤ（ＣＡＮＤＩ−３）の名詞句候補文字列の処理例を説明するための図である。FIG. 14A and FIG. 14B are diagrams for explaining a processing example of a noun phrase candidate character string of a noun phrase candidate ID (CANDI-3). 文書の名詞句を意味記号で置き換えた状態を示す図である。It is a figure which shows the state which replaced the noun phrase of the document with the semantic symbol. 第２の実施形態に係る翻訳端末の機能ブロック図である。It is a functional block diagram of the translation terminal which concerns on 2nd Embodiment. 第２の実施形態における解析部の処理を示すフローチャートである。It is a flowchart which shows the process of the analysis part in 2nd Embodiment. 第２の実施形態における解析対象の文書の一例を示す図である。It is a figure which shows an example of the document of the analysis object in 2nd Embodiment. 図１９（ａ）〜図１９（ｄ）は、図１７のステップＳ１２、Ｓ１３’を説明するための図である。FIGS. 19A to 19D are diagrams for explaining steps S12 and S13 'of FIG. 図１７のステップＳ１３’の処理を示すフローチャートである。It is a flowchart which shows the process of step S13 'of FIG. 図２１（ａ）、図２１（ｂ）は、図１７のステップＳ２０’を説明するための図（その１）である。FIGS. 21A and 21B are views (part 1) for explaining step S20 'in FIG. 図２２（ａ）〜図２２（ｄ）は、図１７のステップＳ２０’を説明するための図（その２）である。FIGS. 22A to 22D are diagrams (part 2) for explaining step S20 'in FIG. 図２３（ａ）〜図２３（ｄ）は、図１７のステップＳ２０’を説明するための図（その３）である。FIG. 23A to FIG. 23D are views (No. 3) for explaining step S20 ′ in FIG. 17. 図１７のステップＳ２４に対応する名詞句リストの一例を示す図である。It is a figure which shows an example of the noun phrase list corresponding to step S24 of FIG. 変形例を示す図である。It is a figure which shows a modification.

《第１の実施形態》
以下、文書解析装置としての翻訳端末の第１の実施形態について、図１〜図１５に基づいて詳細に説明する。なお、本第１の実施形態の翻訳端末１０は、中国語の文書を日本語に翻訳する（中日翻訳する）端末である。 << First Embodiment >>
Hereinafter, a first embodiment of a translation terminal as a document analysis apparatus will be described in detail with reference to FIGS. The translation terminal 10 according to the first embodiment is a terminal that translates a Chinese document into Japanese (translates between Chinese and Japanese).

図１には、翻訳端末１０のハードウェア構成が示されている。翻訳端末１０は、例えば、ＰＣ（Personal Computer）等の端末であり、図１に示すように、ＣＰＵ（Central Processing Unit）９０、ＲＯＭ（Read Only Memory）９２、ＲＡＭ（Random Access Memory）９４、記憶部（ここではＨＤＤ（Hard Disk Drive））９６、ネットワークインタフェース９７、表示部９３、入力部９５、及び可搬型記憶媒体用ドライブ９９等を備えている。これら翻訳端末１０の構成各部は、バス９８に接続されている。表示部９３は、液晶ディスプレイ等を含み、入力部９５は、キーボードやマウス等を含む。翻訳端末１０では、ＲＯＭ９２あるいはＨＤＤ９６に格納されているプログラム（文書解析プログラムを含む）、或いは可搬型記憶媒体用ドライブ９９が可搬型記憶媒体９１から読み取ったプログラム（文書解析プログラムを含む）をＣＰＵ９０が実行することにより、図２に示す機能が実現される。なお、図２には、翻訳端末１０のＨＤＤ９６等に格納されている構文リストとしての構文解析済みコーパス４０も図示されている。 FIG. 1 shows a hardware configuration of the translation terminal 10. The translation terminal 10 is a terminal such as a PC (Personal Computer), for example, and as shown in FIG. 1, a CPU (Central Processing Unit) 90, a ROM (Read Only Memory) 92, a RAM (Random Access Memory) 94, and a storage A unit (here, HDD (Hard Disk Drive)) 96, a network interface 97, a display unit 93, an input unit 95, a portable storage medium drive 99, and the like. Each component of the translation terminal 10 is connected to a bus 98. The display unit 93 includes a liquid crystal display or the like, and the input unit 95 includes a keyboard, a mouse, or the like. In translation terminal 10, CPU 90 stores a program (including a document analysis program) stored in ROM 92 or HDD 96 or a program (including a document analysis program) read from portable storage medium 91 by portable storage medium drive 99. By executing the function, the function shown in FIG. 2 is realized. FIG. 2 also shows a parsed corpus 40 as a syntax list stored in the HDD 96 of the translation terminal 10.

図２には、翻訳端末１０の機能ブロック図が示されている。翻訳端末１０では、ＣＰＵ９０がプログラムを実行することで、図２に示すように、解析部５０及び翻訳部５２としての機能が実現されている。解析部５０は、入力される翻訳対象の中国語文書（例えば図５に示すような特許文書）に含まれる各文の構文解析を行い、構文解析結果を翻訳部５２に対して出力する。翻訳部５２は、各文の構文解析結果に基づいて、各文を日本語に翻訳する。 FIG. 2 shows a functional block diagram of the translation terminal 10. In the translation terminal 10, the functions of the analysis unit 50 and the translation unit 52 are realized by the CPU 90 executing the program, as shown in FIG. The analysis unit 50 performs syntax analysis of each sentence included in the input Chinese document to be translated (for example, a patent document as illustrated in FIG. 5), and outputs the syntax analysis result to the translation unit 52. The translation unit 52 translates each sentence into Japanese based on the syntax analysis result of each sentence.

解析部５０は、入力受付部２０、仮定部としての名詞句候補抽出部２２、名詞句置換部２４、名詞句候補一時保存部２６、整合性計算部２９、構文解析部３０、及び出力部３４を含む。 The analysis unit 50 includes an input reception unit 20, a noun phrase candidate extraction unit 22 as a hypothesis unit, a noun phrase replacement unit 24, a noun phrase candidate temporary storage unit 26, a consistency calculation unit 29, a syntax analysis unit 30, and an output unit 34. including.

入力受付部２０は、翻訳対象、すなわち構文解析対象の中国語文書の入力を受け付け、名詞句候補抽出部２２に対して送信する。名詞句候補抽出部２２は、文書に複数回出現する複数の単語を含む単語群を名詞句候補文字列と仮定し、抽出する。なお、名詞句候補文字列として検出された単語群は、名詞句候補一時保存部２６に一時保存される。ここで、名詞句候補一時保存部２６は、一例として、図７（ａ）〜図７（ｃ）に示すようなデータを一時保存するものとする。 The input receiving unit 20 receives an input of a Chinese document to be translated, that is, a syntax analysis target, and transmits it to the noun phrase candidate extracting unit 22. The noun phrase candidate extraction unit 22 extracts a word group including a plurality of words appearing multiple times in the document as a noun phrase candidate character string. The word group detected as a noun phrase candidate character string is temporarily stored in the noun phrase candidate temporary storage unit 26. Here, as an example, the noun phrase candidate temporary storage unit 26 temporarily stores data as shown in FIGS. 7A to 7C.

名詞句置換部２４は、名詞句候補一時保存部２６に一時保存されたデータに基づいて、名詞句候補文字列を仮の意味記号で表される暫定名詞（[**]）で置換する。なお、暫定名詞[**]は、任意の意味を表す名詞であるものとする。 The noun phrase replacement unit 24 replaces the noun phrase candidate character string with a temporary noun ([**]) represented by a temporary meaning symbol based on the data temporarily stored in the noun phrase candidate temporary storage unit 26. The provisional noun [**] is a noun representing an arbitrary meaning.

整合性計算部２９は、構文解析済みコーパス４０に含まれる複数の正しい構文のうち、名詞句候補文字列を含む複数の文と類似する構文をそれぞれ抽出する。また、整合性計算部２９は、該抽出結果に基づいて、名詞句候補文字列が正しい名詞句であるか否かを判断する。ここで、構文解析済みコーパス４０は、複数の正しい（解析済みの）構文の例を格納するものであり、図３に示すようなデータ構造を有する。具体的には、構文解析済みコーパス４０は、「事例ＩＤ」、及び「名詞句を意味記号で置換した構文構造」のフィールドを有する。「名詞句を意味記号で置換した構文構造」のフィールドには、（（Ａ）（Ｂ））（Ｃ）などの表現（Ａ〜Ｃは単語）により、解析済みの構文のツリー構造が格納されている。なお、事例ＩＤ＝Ｃ１００１の構文構造に含まれる[方法]や[土壌]という表現は、方法や土壌を意味する単語（同義／類義語）を統括して表す記号（意味記号ともいう）である。 The consistency calculating unit 29 extracts, from among a plurality of correct syntaxes included in the parsed corpus 40, syntaxes similar to a plurality of sentences including noun phrase candidate character strings. The consistency calculation unit 29 determines whether the noun phrase candidate character string is a correct noun phrase based on the extraction result. Here, the parsed corpus 40 stores a plurality of examples of correct (parsed) syntax, and has a data structure as shown in FIG. Specifically, the parsed corpus 40 has fields of “example ID” and “syntax structure in which noun phrases are replaced with semantic symbols”. The field of “Syntax structure in which noun phrases are replaced with semantic symbols” stores the tree structure of the parsed syntax by expressions (A to C are words) such as ((A) (B)) (C). ing. Note that the expressions [method] and [soil] included in the syntax structure of the case ID = C1001 are symbols (also referred to as semantic symbols) that collectively represent words (synonyms / synonyms) that mean the method and soil.

構文解析部３０は、整合性計算部２９の判断結果に基づいて、構文解析を実行する。構文解析結果は、出力部３４を介して、翻訳部５２に送信される。 The syntax analysis unit 30 performs syntax analysis based on the determination result of the consistency calculation unit 29. The syntax analysis result is transmitted to the translation unit 52 via the output unit 34.

次に、本第１の実施形態の翻訳端末１０において実行される処理について、図４、図６、図９、図１２のフローチャートに沿って、その他図面を参照しつつ詳細に説明する。 Next, processing executed in the translation terminal 10 of the first embodiment will be described in detail along the flowcharts of FIGS. 4, 6, 9, and 12 with reference to other drawings.

図４には、翻訳端末１０の解析部５０において実行される処理の概要がフローチャートにて示されている。 FIG. 4 is a flowchart showing an outline of processing executed in the analysis unit 50 of the translation terminal 10.

図４の処理では、まず、ステップＳ１０において、入力受付部２０が、中国語文書が入力されるまで待機する。中国語文書が入力されると、ステップＳ１２に移行し、名詞句候補抽出部２２は、名詞句候補文字列抽出処理のサブルーチンを実行する。なお、本第１の実施形態では、図５に示すような特許文書が入力受付部２０に入力されたものとする。図５では、説明の便宜上、特許文書に含まれる各文に対して、原文ＩＤ（Ｑ３０１、Ｑ３０２…）を付して示している。 In the process of FIG. 4, first, in step S10, the input receiving unit 20 waits until a Chinese document is input. When a Chinese document is input, the process proceeds to step S12, and the noun phrase candidate extraction unit 22 executes a subroutine of noun phrase candidate character string extraction processing. In the first embodiment, it is assumed that a patent document as shown in FIG. In FIG. 5, for convenience of explanation, each sentence included in the patent document is shown with an original sentence ID (Q301, Q302...).

ステップＳ１２においては、図６に示すフローチャートに沿った処理が実行される。図６の処理では、ステップＳ４０において、名詞句候補抽出部２２は、文書中のテキストを文単位で抽出する。この場合、図５に示す原文ＩＤごとに文を抽出する。 In step S12, processing according to the flowchart shown in FIG. 6 is executed. In the process of FIG. 6, in step S40, the noun phrase candidate extraction unit 22 extracts the text in the document in sentence units. In this case, a sentence is extracted for each original sentence ID shown in FIG.

次いで、ステップＳ４２では、名詞句候補抽出部２２は、文ペアを抽出する。例えば、名詞句候補抽出部２２は、原文ＩＤ＝Ｑ３０１、Ｑ３０２の文ペアを抽出する。 Next, in step S42, the noun phrase candidate extraction unit 22 extracts sentence pairs. For example, the noun phrase candidate extraction unit 22 extracts sentence pairs with original sentence ID = Q301 and Q302.

次いで、ステップＳ４４では、名詞句候補抽出部２２が、文ペアの部分一致文字列を検出する。この場合、ステップＳ４２で抽出した文ペアを比較し、共通する文字列部分を検出する。 Next, in step S44, the noun phrase candidate extraction unit 22 detects a partially matched character string of the sentence pair. In this case, the sentence pairs extracted in step S42 are compared, and a common character string portion is detected.

次いで、ステップＳ４６では、名詞句候補抽出部２２が、部分一致文字列が検出された場合において、該部分一致文字列が既に検出された文字列か否かを判断する。ここでの判断が否定された場合には、ステップＳ４８に移行し、名詞句候補抽出部２２は、名詞句候補一時保存部２６に保存する。この場合、名詞句候補一時保存部２６には、例えば、図７（ａ）〜図７（ｃ）に示すような構造のデータが保存される。図７（ａ）〜図７（ｃ）のデータは、「名詞句候補ＩＤ」、「名詞句候補文字列」、及び「出現原文ＩＤ」の各項目を有する。「名詞句候補ＩＤ」の項目には、ＣＡＮＤＩ−１、ＣＡＮＤＩ−２、…のような名詞句候補の識別情報が格納される。「名詞句候補文字列」の項目には、ステップＳ４４で検出された部分一致文字列が格納される。「出現原文ＩＤ」の項目には、部分一致文字列が検出された文の原文ＩＤが格納される。 Next, in step S46, the noun phrase candidate extraction unit 22 determines whether or not the partially matched character string is already detected when the partially matched character string is detected. If the determination is negative, the process proceeds to step S48, and the noun phrase candidate extraction unit 22 stores the noun phrase candidate temporary storage unit 26. In this case, in the noun phrase candidate temporary storage unit 26, for example, data having a structure as shown in FIGS. 7A to 7C is stored. The data of FIG. 7A to FIG. 7C includes items of “noun phrase candidate ID”, “noun phrase candidate character string”, and “appearing original sentence ID”. The item of “noun phrase candidate ID” stores identification information of noun phrase candidates such as CANDI-1, CANDI-2,. In the item “noun phrase candidate character string”, the partially matched character string detected in step S44 is stored. In the item “appearing original sentence ID”, the original sentence ID of the sentence in which the partially matched character string is detected is stored.

ステップＳ４８の後は、名詞句候補抽出部２２は、ステップＳ５０に移行する。一方、ステップＳ４６の判断が肯定された場合には、名詞句候補抽出部２２は、ステップＳ４９において、ステップＳ４２で抽出された文ペアの原文ＩＤを名詞句候補一時保存部２６の対応するデータの「出現原文ＩＤ」の欄に格納する。その後は、ステップＳ５０に移行する。 After step S48, the noun phrase candidate extraction unit 22 proceeds to step S50. On the other hand, if the determination in step S46 is affirmative, the noun phrase candidate extraction unit 22 determines that the original text ID of the sentence pair extracted in step S42 is the corresponding data of the noun phrase candidate temporary storage unit 26 in step S49. Stored in the “appearance original text ID” field. Thereafter, the process proceeds to step S50.

ステップＳ５０に移行すると、名詞句候補抽出部２２は、全ての文ペアを抽出したか否かを判断する。ここでの判断が否定された場合には、名詞句候補抽出部２２は、ステップＳ４２に戻り、上述したステップＳ４２〜Ｓ５０の処理・判断を繰り返す。なお、例えば、文ペアとして原文ＩＤ＝Ｑ３０２、Ｑ３０４の文ペアが抽出された場合には、図７（ａ）〜図７（ｃ）に名詞句候補文字列として示されているような部分一致文字列が検出され、名詞句候補一時保存部２６に格納される。また、例えば、文ペアとして原文ＩＤ＝Ｑ３０４、Ｑ３０５の文ペアが抽出された場合には、図７（ｂ）、図７（ｃ）に名詞句候補文字列として示されているような部分一致文字列が検出される。また、例えば、文ペアとして原文ＩＤ＝Ｑ３０６、Ｑ３０７の文ペアが抽出された場合には、図７（ｃ）に名詞句候補文字列として示されているような部分一致文字列が検出される。 In step S50, the noun phrase candidate extraction unit 22 determines whether all sentence pairs have been extracted. If the determination here is negative, the noun phrase candidate extraction unit 22 returns to step S42 and repeats the processes and determinations of steps S42 to S50 described above. Note that, for example, when a sentence pair of the original sentence ID = Q302 and Q304 is extracted as a sentence pair, partial matching as shown as noun phrase candidate character strings in FIGS. 7 (a) to 7 (c). A character string is detected and stored in the noun phrase candidate temporary storage unit 26. Also, for example, when a sentence pair with the original sentence ID = Q304 and Q305 is extracted as a sentence pair, partial matching as shown as noun phrase candidate character strings in FIGS. 7B and 7C A string is detected. Further, for example, when a sentence pair with original sentence ID = Q306 and Q307 is extracted as a sentence pair, a partially matched character string as shown as a noun phrase candidate character string in FIG. 7C is detected. .

その後、すべての文ペアを抽出し、ステップＳ５０の判断が肯定されると、ステップＳ５２に移行する。ステップＳ５２に移行すると、名詞句候補抽出部２２は、名詞句候補一時保存部２６に保存されているデータを名詞句候補文字列の文字数が多い順に並べ替える（ソートする）。以上のようにして、図６の処理が終了すると、図４のステップＳ１４に移行する。 Thereafter, all sentence pairs are extracted, and if the determination in step S50 is affirmed, the process proceeds to step S52. In step S52, the noun phrase candidate extraction unit 22 rearranges (sorts) the data stored in the noun phrase candidate temporary storage unit 26 in descending order of the number of characters in the noun phrase candidate character string. When the process in FIG. 6 is completed as described above, the process proceeds to step S14 in FIG.

図４のステップＳ１４では、名詞句置換部２４が、１つの名詞句候補文字列を特定する。この場合、名詞句置換部２４は、名詞句候補一時保存部２６に一時保存されているデータのうち、候補文字列の文字数が最も多いものを特定する。ここでは、図８（ａ）に示すデータ（名詞句候補ＩＤ＝ＣＡＮＤＩ−１の名詞句候補文字列）が特定されたものとする。 In step S14 in FIG. 4, the noun phrase replacement unit 24 identifies one noun phrase candidate character string. In this case, the noun phrase replacement unit 24 identifies data having the largest number of characters in the candidate character string among the data temporarily stored in the noun phrase candidate temporary storage unit 26. Here, it is assumed that the data shown in FIG. 8A (noun phrase candidate character string of noun phrase candidate ID = CANDI-1) is specified.

次いで、ステップＳ１６では、名詞句置換部２４は、文書中の名詞句候補文字列を暫定名詞[**]に置換する。具体的には、名詞句置換部２４は、図８（ｂ）に示すように、文書の中から、ステップＳ１４で特定した名詞句候補文字列を含んでいる文を図８（ａ）のデータの出現原文ＩＤ（＝Ｑ３０２，Ｑ３０４）に基づいて抽出する。そして、名詞句置換部２４は、図８（ｃ）に示すように、抽出された文のうち、名詞句候補文字列の部分を暫定名詞（仮の意味記号）[**]で置換する。 Next, in step S16, the noun phrase replacement unit 24 replaces the noun phrase candidate character string in the document with the provisional noun [**]. Specifically, as shown in FIG. 8B, the noun phrase replacement unit 24 converts the sentence including the noun phrase candidate character string specified in step S14 from the document into the data in FIG. 8A. Are extracted based on the original text ID (= Q302, Q304). Then, as shown in FIG. 8C, the noun phrase replacement unit 24 replaces the portion of the noun phrase candidate character string in the extracted sentence with a provisional noun (provisional semantic symbol) [**].

次いで、ステップＳ１８では、整合性計算部２９が、名詞句候補文字列に対する意味記号候補を抽出する処理のサブルーチンを実行する。このステップＳ１８の処理においては、図９のフローチャートに沿った処理が実行される。 Next, in step S18, the consistency calculation unit 29 executes a subroutine of processing for extracting semantic symbol candidates for the noun phrase candidate character string. In the process of step S18, the process according to the flowchart of FIG. 9 is executed.

図９の処理では、まず、ステップＳ６０において、整合性計算部２９は、名詞句候補文字列を暫定名詞で置換した文集合Ｚ（文数ｋmax）を取得する。ここでは、整合性計算部２９は、図８（ｃ）に示す２つの文（文数ｋmax＝２）を取得したとする。 In the process of FIG. 9, first, in step S60, the consistency calculation unit 29 acquires a sentence set Z (sentence number kmax) obtained by replacing a noun phrase candidate character string with a provisional noun. Here, it is assumed that the consistency calculation unit 29 has acquired two sentences (sentence number kmax = 2) shown in FIG.

次いで、ステップＳ６２では、整合性計算部２９は、文の処理数を示すパラメータｋを１に設定する。次いで、ステップＳ６４では、整合性計算部２９は、文集合Ｚに含まれる文ｚｋ（＝ｚ１）に対し、構文解析を行い、構文解析結果を獲得する。ここでは、例えば、図８（ｃ）の原文ＩＤ＝Ｑ３０２の文についての構文解析を行い、図８（ｄ）に示す構文解析結果を獲得したものとする。 Next, in step S62, the consistency calculating unit 29 sets a parameter k indicating the number of processed sentences to 1. Next, in step S64, the consistency calculation unit 29 performs syntax analysis on the sentence zk (= z1) included in the sentence set Z, and obtains a syntax analysis result. Here, for example, it is assumed that the syntax analysis of the sentence with the original sentence ID = Q302 in FIG. 8C is performed and the result of the syntax analysis shown in FIG.

次いで、ステップＳ６６では、整合性計算部２９は、構文解析済みコーパス４０において、構文解析結果と類似する構文解析事例を検索する。なお、図８（ｄ）の原文ＩＤ＝Ｑ３０２については、類似する構文解析事例が存在していなかったものとする。 Next, in step S66, the consistency calculating unit 29 searches the parsed corpus 40 for a parsing case similar to the parsing result. Note that it is assumed that there is no similar syntax analysis example for the original text ID = Q302 in FIG.

次いで、ステップＳ６８では、整合性計算部２９は、ステップＳ６６の結果、類似する構文解析事例が存在したか否かを判断する。ここでの判断が否定された場合には、ステップＳ７４に移行し、整合性計算部２９は、ｋがｋmax（＝２）であるか否かを判断する。ここでの判断が否定された場合には、ステップＳ７６に移行し、整合性計算部２９は、ｋを１インクリメント（ｋ←ｋ＋１）し、ステップＳ６４に戻る。 Next, in step S68, the consistency calculation unit 29 determines whether there is a similar syntax analysis example as a result of step S66. If the determination is negative, the process proceeds to step S74, and the consistency calculation unit 29 determines whether k is kmax (= 2). If the determination here is negative, the process proceeds to step S76, and the consistency calculating unit 29 increments k by 1 (k ← k + 1), and returns to step S64.

ステップＳ６４に戻ると、整合性計算部２９は、次の文ｚ２として、図８（ｃ）の原文ＩＤ＝Ｑ３０４の構文解析を行い、構文解析結果を獲得する。ここでは、図８（ｄ）の原文ＩＤ＝Ｑ３０４の構文解析結果を得ることができたとする。次いで、ステップＳ６６では、整合性計算部２９は、構文解析済みコーパス４０において構文解析結果と類似する構文解析事例を検索する。ここでは、類似する構文解析事例が存在しなかったものとする。したがって、次のステップＳ６８の判断は否定され、ステップＳ７４に移行する。ステップＳ７４に移行すると、整合性計算部２９は、ｋがｋmax（＝２）であるか否かを判断する。ここでの判断が肯定されると、図９の全処理を終了し、図４のステップＳ１９に移行する。 Returning to step S64, the consistency calculating unit 29 performs syntax analysis of the original sentence ID = Q304 in FIG. 8C as the next sentence z2, and obtains a syntax analysis result. Here, it is assumed that the syntax analysis result of the original sentence ID = Q304 in FIG. Next, in step S66, the consistency calculation unit 29 searches for a parsing case similar to the parsing result in the parsed corpus 40. Here, it is assumed that there is no similar parsing example. Therefore, the determination at the next step S68 is denied, and the process proceeds to step S74. In step S74, the consistency calculation unit 29 determines whether k is kmax (= 2). If the determination here is affirmed, the entire process of FIG. 9 is terminated, and the process proceeds to step S19 of FIG.

図４のステップＳ１９に移行すると、整合性計算部２９は、ステップＳ１８の処理において意味記号候補を抽出できたか否かを判断する。すなわち、整合性計算部２９は、直前に行われたステップＳ１８の処理において、ステップＳ７２の処理が実行されたか否かを判断する。このステップＳ１９の判断が否定されると、ステップＳ２６に移行する。なお、ステップＳ１９の判断が否定される場合とは、名詞句候補文字列として抽出した部分が、正しい名詞句でなかったことを意味する。 When the process proceeds to step S19 in FIG. 4, the consistency calculation unit 29 determines whether or not a semantic symbol candidate has been extracted in the process of step S18. That is, the consistency calculation unit 29 determines whether or not the process of step S72 has been executed in the process of step S18 performed immediately before. If the determination in step S19 is negative, the process proceeds to step S26. The case where the determination in step S19 is negative means that the part extracted as the noun phrase candidate character string is not a correct noun phrase.

ステップＳ２６に移行すると、整合性計算部２９は、全ての名詞句候補を特定したか否かを判断する。このステップＳ２６の判断が否定されると、ステップＳ１４に戻り、名詞句置換部２４は、２番目に文字数の多い名詞句候補文字列を特定する。ここでは、図１０（ａ）に示すように、名詞句候補ＩＤ＝ＣＡＮＤＩ−２の名詞句候補文字列が特定されたものとする。 If transfering it to step S26, the consistency calculation part 29 will judge whether all the noun phrase candidates were specified. If the determination in step S26 is negative, the process returns to step S14, and the noun phrase replacement unit 24 specifies a noun phrase candidate character string having the second largest number of characters. Here, as shown in FIG. 10A, it is assumed that a noun phrase candidate character string of noun phrase candidate ID = CANDI-2 is specified.

次いで、ステップＳ１６では、名詞句置換部２４は、文書中の名詞句候補文字列を暫定名詞[**]に置換する。具体的には、名詞句置換部２４は、図１０（ｂ）に示すように、文書の中から、名詞句候補文字列を含んでいる文を図１０（ａ）のデータの出現原文ＩＤ（＝Ｑ３０２，Ｑ３０４、Ｑ３０５，Ｑ３０６）に基づいて抽出する。そして、名詞句置換部２４は、抽出された文のうち、名詞句候補文字列の部分を暫定名詞 [**]で置換する。 Next, in step S16, the noun phrase replacement unit 24 replaces the noun phrase candidate character string in the document with the provisional noun [**]. Specifically, as shown in FIG. 10B, the noun phrase replacement unit 24 converts a sentence including a noun phrase candidate character string from the document into the appearance original sentence ID ( = Q302, Q304, Q305, Q306). Then, the noun phrase replacement unit 24 replaces the part of the noun phrase candidate character string with the provisional noun [**] in the extracted sentence.

次いで、ステップＳ１８では、整合性計算部２９は、名詞句候補文字列に対する意味記号候補を抽出する処理を実行する（図９）。 Next, in step S18, the consistency calculation unit 29 executes a process of extracting semantic symbol candidates for the noun phrase candidate character string (FIG. 9).

図９の処理では、ステップＳ６０において、整合性計算部２９は、名詞句候補文字列を暫定名詞で置換した文集合Ｚ（文数ｋmax＝４）を取得する。 In the process of FIG. 9, in step S60, the consistency calculation unit 29 acquires a sentence set Z (sentence number kmax = 4) obtained by replacing a noun phrase candidate character string with a provisional noun.

次いで、ステップＳ６２では、整合性計算部２９は、文を表すパラメータｋを１に設定する。次いで、ステップＳ６４では、整合性計算部２９は、文集合Ｚに含まれる文ｚk（＝ｚ１）に対し、構文解析を行い、構文解析結果を獲得する。ここでは、例えば、図１０（ｂ）の原文ＩＤ＝Ｑ３０２の下線部を暫定名詞で置換したものについての構文解析を行い、図１０（ｃ）に示す構文解析結果を獲得したものとする。 Next, in step S62, the consistency calculating unit 29 sets a parameter k representing a sentence to 1. Next, in step S64, the consistency calculation unit 29 performs syntax analysis on the sentence zk (= z1) included in the sentence set Z, and obtains a syntax analysis result. Here, for example, it is assumed that syntax analysis is performed on the underlined portion of the original text ID = Q302 in FIG. 10B replaced with a temporary noun, and the syntax analysis result shown in FIG. 10C is obtained.

次いで、ステップＳ６６では、整合性計算部２９は、構文解析済みコーパス４０において、構文解析結果と類似する構文解析事例を検索する。なお、図１０（ｃ）の原文ＩＤ＝Ｑ３０２については、類似する構文解析事例として、図３に示す、事例ＩＤ＝Ｃ１００１，Ｃ１００２の２つの事例が検索されたものとする。 Next, in step S66, the consistency calculating unit 29 searches the parsed corpus 40 for a parsing case similar to the parsing result. For the original text ID = Q302 in FIG. 10C, it is assumed that two cases ID = C1001 and C1002 shown in FIG. 3 are searched as similar syntax analysis cases.

次いで、ステップＳ６８では、整合性計算部２９は、ステップＳ６６の結果、類似する構文解析事例が存在したか否かを判断する。ここでの判断が肯定されると、ステップＳ７２に移行し、整合性計算部２９は、検索された構文解析事例から、名詞句候補文字列の意味記号候補を特定し、意味候補リストに保存する。この場合、図１１（ａ）に示すように、検索された構文解析事例（Ｃ１００１）のうち、文の暫定名詞[**]と対応する意味記号が、[中子]であり、検索された構文解析事例（Ｃ１００２）のうち、文の暫定名詞[**]と対応する意味記号が、［細菌]であるので、整合性計算部２９は、これらの意味記号を意味記号候補と特定し、意味候補リストに保存する。 Next, in step S68, the consistency calculation unit 29 determines whether there is a similar syntax analysis example as a result of step S66. If the determination here is affirmed, the process proceeds to step S72, and the consistency calculation unit 29 identifies a semantic symbol candidate of the noun phrase candidate character string from the searched parsing case and stores it in the semantic candidate list. . In this case, as shown in FIG. 11A, in the searched parsing case (C1001), the semantic symbol corresponding to the provisional noun [**] of the sentence is [core]. In the syntactic analysis example (C1002), the semantic symbol corresponding to the provisional noun [**] of the sentence is [bacteria], so the consistency calculator 29 identifies these semantic symbols as semantic symbol candidates, Save to the semantic candidate list.

次いで、ステップＳ７４に移行すると、整合性計算部２９は、ｋがｋmax（＝４）であるか否かを判断する。ここでの判断が否定された場合には、ステップＳ７６に移行し、整合性計算部２９は、ｋを１インクリメント（ｋ←ｋ＋１）し、ステップＳ６４に戻る。 Next, in step S74, the consistency calculation unit 29 determines whether k is kmax (= 4). If the determination here is negative, the process proceeds to step S76, and the consistency calculating unit 29 increments k by 1 (k ← k + 1), and returns to step S64.

ステップＳ６４に戻ると、整合性計算部２９は、次の文ｚ２として、図１０（ｂ）の原文ＩＤ＝Ｑ３０４の文を暫定名詞［**］で置換したものを構文解析し、構文解析結果を獲得する。ここでは、図１０（ｃ）の原文ＩＤ＝Ｑ３０４の構文解析結果を得ることができたとする。次いで、ステップＳ６６では、整合性計算部２９は、構文解析済みコーパス４０において構文解析結果と類似する構文解析事例を検索する。ここでは、類似する構文解析事例として、事例ＩＤ＝Ｃ１００４の事例が検索されたものとする。この場合、ステップＳ６８の判断が肯定され、ステップＳ７２において、整合性計算部２９は、意味記号［細菌］を意味記号候補と特定し、意味候補リストに保存する。その後、ステップＳ７４の判断が否定されると、ステップＳ７６に移行し、整合性計算部２９は、ｋを１インクリメント（ｋ←ｋ＋１）し、ステップＳ６４に戻る。 Returning to step S64, the consistency calculation unit 29 parses the sentence z2 in which the sentence with the original sentence ID = Q304 in FIG. To win. Here, it is assumed that the syntax analysis result of the original sentence ID = Q304 in FIG. Next, in step S66, the consistency calculation unit 29 searches for a parsing case similar to the parsing result in the parsed corpus 40. Here, it is assumed that a case of case ID = C1004 is searched as a similar syntax analysis case. In this case, the determination in step S68 is affirmed, and in step S72, the consistency calculation unit 29 identifies the semantic symbol [bacteria] as a semantic symbol candidate and stores it in the semantic candidate list. Thereafter, when the determination in step S74 is negative, the process proceeds to step S76, and the consistency calculating unit 29 increments k by 1 (k ← k + 1), and returns to step S64.

以降、ステップＳ６４〜Ｓ７６の処理、判断を繰り返し、整合性計算部２９は、図１０（ｂ）の残りの２つの文についても構文解析を行う。なお、本第１の実施形態では、原文ＩＤ＝Ｑ３０５の文に関しては、意味記号［細菌］が意味記号候補と特定され、原文ＩＤ＝Ｑ３０６の文に関しては、意味記号候補は特定されなかったものとする。 Thereafter, the processes and determinations in steps S64 to S76 are repeated, and the consistency calculation unit 29 also performs syntax analysis on the remaining two sentences in FIG. In the first embodiment, for the sentence with the original sentence ID = Q305, the semantic symbol [bacteria] is identified as a semantic symbol candidate, and for the sentence with the original sentence ID = Q306, no semantic symbol candidate is identified. And

その後、ステップＳ７４の判断が肯定されると、図４のステップＳ１９に移行する。 Thereafter, when the determination in step S74 is affirmed, the process proceeds to step S19 in FIG.

ステップＳ１９に移行すると、整合性計算部２９は、ステップＳ１８の処理において意味記号候補を抽出できたか否かを判断する。すなわち、整合性計算部２９は、直前に行われたステップＳ１８の処理において、ステップＳ７２の処理が実行されたか否かを判断する。このステップＳ１９の判断が肯定されると、整合性計算部２９は、ステップＳ２０に移行する。 If transfering it to step S19, the consistency calculation part 29 will judge whether the semantic symbol candidate was able to be extracted in the process of step S18. That is, the consistency calculation unit 29 determines whether or not the process of step S72 has been executed in the process of step S18 performed immediately before. If the determination in step S19 is affirmed, the consistency calculating unit 29 proceeds to step S20.

ステップＳ２０では、整合性計算部２９は、名詞句候補文字列が意味記号候補の場合の整合性評価のサブルーチンを実行する。本ステップＳ２０では、具体的には、図１２のフローチャートに沿った処理が実行される。 In step S20, the consistency calculation unit 29 executes a subroutine for consistency evaluation when the noun phrase candidate character string is a semantic symbol candidate. In step S20, specifically, processing according to the flowchart of FIG. 12 is executed.

図１２の処理では、まず、ステップＳ８０において、整合性計算部２９は、文の処理数を示すパラメータｋを１、意味記号候補の特定数を示すパラメータｃを１、整合性評価に用いるパラメータｎを０に設定する。なお、ここでは、パラメータｋの最大値ｋmaxは４であり、パラメータｃの最大値ｃmaxは２である。 In the process of FIG. 12, first, in step S80, the consistency calculation unit 29 sets the parameter k indicating the number of sentence processes to 1, the parameter c indicating the specific number of semantic symbol candidates, and the parameter n used for consistency evaluation. Is set to 0. Here, the maximum value kmax of the parameter k is 4, and the maximum value cmax of the parameter c is 2.

次いで、ステップＳ８２では、整合性計算部２９は、意味記号候補Ｔｃ（＝Ｔ１）を特定する。ここでは、意味記号候補Ｔｃとして、［中子］が特定されたものとする。次いで、整合性計算部２９は、文ｚｋ（＝ｚ１）の意味候補リストに意味記号候補Ｔc（＝Ｔ１）が存在するか否かを判断する。図１１（ｂ）に示すよう原文ＩＤ＝Ｑ３０２の文は、意味記号候補Ｔｃが［中子］である場合に、一致事例（Ｃ１００１）が存在しているので、ステップＳ８４の判断は肯定され、ステップＳ８６に移行する。 Next, in step S82, the consistency calculation unit 29 specifies a semantic symbol candidate Tc (= T1). Here, it is assumed that [core] is specified as the semantic symbol candidate Tc. Next, the consistency calculation unit 29 determines whether or not there is a semantic symbol candidate Tc (= T1) in the semantic candidate list of the sentence zk (= z1). As shown in FIG. 11B, the sentence with the original sentence ID = Q302 has a matching case (C1001) when the semantic symbol candidate Tc is [core], so the determination in step S84 is affirmed, Control goes to step S86.

ステップＳ８６に移行すると、整合性計算部２９は、ｎを１インクリメント（ｎ←ｎ＋１）し、ステップＳ８８に移行する。ステップＳ８８では、整合性計算部２９は、ｋが最大値（ｋmax）であるか否かを判断し、判断が否定されると、ステップＳ９０において、ｋを１インクリメント（ｋ←ｋ＋１）し、ステップＳ８２に戻る。その後は、整合性計算部２９は、全ての文の意味候補リストに意味記号候補Ｔ１が存在するか否かを判断し、ｋが最大値（ｋmax）となった段階で、ステップＳ９２に移行する。 When the process proceeds to step S86, the consistency calculating unit 29 increments n by 1 (n ← n + 1), and the process proceeds to step S88. In step S88, the consistency calculating unit 29 determines whether k is the maximum value (kmax). If the determination is negative, in step S90, k is incremented by 1 (k ← k + 1), Return to S82. Thereafter, the consistency calculation unit 29 determines whether or not there is a semantic symbol candidate T1 in the semantic candidate list of all sentences, and proceeds to step S92 when k reaches the maximum value (kmax). .

ステップＳ９２に移行すると、整合性計算部２９は、暫定名詞［**］が意味記号候補Ｔc（Ｔ１）である場合の整合性評価値を算出する。具体的には、次式（１）より、整合性評価値を算出する。
整合性評価値＝ｎ／ｋmax …（１） If transfering it to step S92, the consistency calculation part 29 will calculate the consistency evaluation value in case provisional noun [**] is the semantic symbol candidate Tc (T1). Specifically, a consistency evaluation value is calculated from the following equation (1).
Consistency evaluation value = n / kmax (1)

図１１（ｂ）の場合、ｎ＝１となるため、整合性評価値は、１／４＝０．２５となる。 In the case of FIG. 11B, since n = 1, the consistency evaluation value is 1/4 = 0.25.

次いで、ステップＳ９４では、整合性計算部２９は、ｃがｃの最大値（ｃmax＝２）であるか否かを判断する。ここでの判断が否定されると、ステップＳ９６に移行し、整合性計算部２９は、ｋを１に戻すとともに、ｃを１インクリメント（ｃ←ｃ＋１）し、ステップＳ８２に戻る。 Next, in step S94, the consistency calculation unit 29 determines whether c is the maximum value of c (cmax = 2). If the determination here is negative, the process proceeds to step S96, and the consistency calculating unit 29 returns k to 1, increments c by 1 (c ← c + 1), and returns to step S82.

ステップＳ８２に戻った後は、上述したように、ステップＳ８２〜Ｓ９２の処理・判断を実行する。ここで、意味記号候補Ｔ２として、［細菌］が抽出された場合、図１１（ｃ）に示すように、意味候補リストに意味記号候補Ｔ２が存在する文の数ｎは、３である。したがって、ステップＳ９２では、整合性計算部２９は、整合性評価値として、３／４＝０．７５を算出する。 After returning to step S82, the processing / determination of steps S82 to S92 is executed as described above. Here, when [bacteria] is extracted as the semantic symbol candidate T2, the number n of sentences having the semantic symbol candidate T2 in the semantic candidate list is 3, as shown in FIG. Therefore, in step S92, the consistency calculation unit 29 calculates 3/4 = 0.75 as the consistency evaluation value.

その後、ステップＳ９４の判断が肯定されると、整合性計算部２９は、図１２の全処理を終了し、図４のステップＳ２２に移行する。ステップＳ２２に移行すると、整合性計算部２９は、整合性評価値が最大、かつ閾値を超えた意味記号候補があるか否かを判断する。例えば、閾値が０．５であるとすると、図１１（ｂ）、図１１（ｃ）の例では、整合性評価値の最大値０．７５が閾値よりも大きいので、ステップＳ２２の判断が肯定され、ステップＳ２４に移行する。ステップＳ２４では、整合性計算部２９は、検出した名詞句と、意味記号候補を名詞句リストに登録する。図１３（ａ）には、名詞句リストの一例が示されている。名詞句リストにおいては、ステップＳ１４で特定された名詞句候補文字列が、「抽出された名詞句」の欄に格納され、ステップＳ２２で特定された意味記号候補が、「意味記号」の欄に格納される。その後は、ステップＳ２６に移行する。なお、ステップＳ２２が肯定される場合とは、名詞句候補文字列として抽出した部分が正しい名詞句であったことを意味する。一方、ステップＳ２２の判断が否定された場合には、ステップＳ２４を経ずに、ステップＳ２６に移行する。なお、ステップＳ２２が否定される場合とは、名詞句候補文字列と仮定して抽出した部分が正しい名詞句ではなかったことを意味する。 Thereafter, when the determination in step S94 is affirmed, the consistency calculation unit 29 ends all the processes in FIG. 12 and proceeds to step S22 in FIG. In step S22, the consistency calculation unit 29 determines whether there is a semantic symbol candidate having a maximum consistency evaluation value and exceeding a threshold value. For example, assuming that the threshold value is 0.5, in the example of FIG. 11B and FIG. 11C, the maximum value 0.75 of the consistency evaluation value is larger than the threshold value. Then, the process proceeds to step S24. In step S24, the consistency calculation unit 29 registers the detected noun phrase and the meaning symbol candidate in the noun phrase list. FIG. 13A shows an example of a noun phrase list. In the noun phrase list, the noun phrase candidate character string specified in step S14 is stored in the “extracted noun phrase” column, and the semantic symbol candidate specified in step S22 is stored in the “semantic symbol” column. Stored. Thereafter, the process proceeds to step S26. In addition, the case where step S22 is affirmed means that the part extracted as a noun phrase candidate character string was a correct noun phrase. On the other hand, if the determination in step S22 is negative, the process proceeds to step S26 without passing through step S24. In addition, the case where step S22 is denied means that the part extracted on the assumption that it is a noun phrase candidate character string was not a correct noun phrase.

ステップＳ２６に移行すると、整合性計算部２９は、全ての名詞句候補文字列を特定したか否かを判断する。このステップＳ２６の判断が否定されると、ステップＳ１４に戻り、次の名詞句候補文字列について、ステップＳ１４〜Ｓ２６の処理・判断が実行される。例えば、図１４（ａ）に示すように、名詞句候補ＩＤ＝ＣＡＮＤＩ−３の名詞句候補文字列が特定されたとする（Ｓ１４）。この場合、図１４（ｂ）に示すように、名詞句候補ＩＤ＝ＣＡＮＤＩ−３の出現原文ＩＤのうち、名詞句候補ＩＤ＝ＣＡＮＤＩ−２の出現原文ＩＤに含まれていないＩＤの文（原文ＩＤ＝Ｑ３０７、Ｑ３０８）の名詞句候補文字列を暫定名詞［**］に変換するなどして、該名詞句候補文字列の意味記号として、［細菌］を特定する。これにより、名詞句リストには、図１３（ｂ）に示す２つ目のデータが追加される。 If transfering it to step S26, the consistency calculation part 29 will judge whether all the noun phrase candidate character strings were specified. If the determination in step S26 is negative, the process returns to step S14, and the processing / determination of steps S14 to S26 is executed for the next noun phrase candidate character string. For example, as shown in FIG. 14A, it is assumed that a noun phrase candidate character string having a noun phrase candidate ID = CANDI-3 is specified (S14). In this case, as shown in FIG. 14 (b), among the appearing original sentence IDs of the noun phrase candidate ID = CANDI-3, the sentence of the ID not included in the appearing original sentence ID of the noun phrase candidate ID = CANDI-2 (original sentence) [Bacteria] is specified as a semantic symbol of the noun phrase candidate character string by converting the noun phrase candidate character string of ID = Q307, Q308) into a provisional noun [**]. Thereby, the second data shown in FIG. 13B is added to the noun phrase list.

その後、ステップＳ２６の判断が肯定されると、ステップＳ２８に移行し、構文解析部３０は、名詞句リストに基づいて名詞句を意味記号に置換し、構文解析済みコーパス４０を用いて文書を解析する。これにより、構文解析部３０は、名詞句を適切に区切り、適切な意味記号で置換した文を解析することができるため、高精度な構文解析結果を得ることができる。その後、構文解析部３０は、ステップＳ３０に移行し、出力部３４を介して、解析結果を翻訳部５２に対して出力する。 Thereafter, when the determination in step S26 is affirmed, the process proceeds to step S28, where the syntax analysis unit 30 replaces the noun phrase with a semantic symbol based on the noun phrase list, and analyzes the document using the parsed corpus 40. To do. As a result, the syntax analysis unit 30 can analyze a sentence in which noun phrases are appropriately separated and replaced with appropriate semantic symbols, so that a highly accurate syntax analysis result can be obtained. Thereafter, the syntax analysis unit 30 proceeds to step S30 and outputs the analysis result to the translation unit 52 via the output unit 34.

なお、翻訳部５２では、高精度な構文解析結果を用いて、文書を翻訳することができる。これにより、高精度な翻訳結果を得ることが可能となる。 Note that the translation unit 52 can translate a document using a highly accurate syntax analysis result. This makes it possible to obtain a highly accurate translation result.

なお、上記説明から分かるように、図４のステップＳ１４〜Ｓ２２の処理においては、構文解析済みコーパス４０に含まれる構文事例のうち、名詞句候補文字列を含む複数の文と類似する構文事例をそれぞれ抽出し（Ｓ６６、図１１（ａ）参照）、ステップＳ１８，Ｓ２０、Ｓ２２において、抽出結果に基づいて、名詞句候補文字列の抽出が正しかったか判断し、名詞句候補文字列の意味を特定しているといえる。すなわち、本実施形態の整合性計算部２９により、構文解析済みコーパス４０に含まれる複数の正しい構文のうち、名詞句候補文字列を含む複数の文と類似する構文をそれぞれ抽出する抽出部、及び抽出部による抽出結果に基づいて、名詞句候補文字列が正しい名詞句であるか否かを判断する判断部としての機能が実現されている。また、本実施形態の整合性計算部２９により、抽出部が抽出した構文に基づいて、名詞句候補文字列の意味を特定する特定部としての機能が実現されている。 As can be seen from the above description, in the processing of steps S14 to S22 in FIG. 4, among the syntax examples included in the parsed corpus 40, syntax examples similar to a plurality of sentences including noun phrase candidate character strings are included. Each is extracted (see S66, FIG. 11A), and in steps S18, S20, and S22, based on the extraction results, it is determined whether the extraction of the noun phrase candidate character string is correct, and the meaning of the noun phrase candidate character string is specified. It can be said that. That is, by the consistency calculation unit 29 according to the present embodiment, an extraction unit that extracts, from among a plurality of correct syntaxes included in the parsed corpus 40, syntaxes similar to a plurality of sentences including noun phrase candidate character strings, and A function as a determination unit that determines whether or not the noun phrase candidate character string is a correct noun phrase based on the extraction result by the extraction unit is realized. In addition, the consistency calculation unit 29 of the present embodiment realizes a function as a specifying unit that specifies the meaning of the noun phrase candidate character string based on the syntax extracted by the extraction unit.

以上、詳細に説明したように、本第１の実施形態によると、解析部５０では、名詞句候補抽出部２２が、文書内に複数回出現する単語群を名詞句候補文字列として抽出し（Ｓ１２）、整合性計算部２９が、構文解析済みコーパス４０に含まれる複数の構文事例のうち、名詞句候補文字列を含む複数の文と類似する構文事例をそれぞれ抽出し、複数の文それぞれについて抽出された構文事例に基づいて、名詞句候補文字列の抽出が正しいか否かを判断する（Ｓ１４〜Ｓ２２）。これにより、複数の文を用いて名詞句候補文字列を仮定し、該仮定が正しいか否かを判断することで、文全体において名詞句候補文字列が正しい名詞句であるか否かを精度よく判断することができる。この場合、例えば、１つの文において名詞句候補文字列を仮定し、構文解析済みコーパス４０の構文事例と比較する方法よりも、精度よく名詞句候補文字列が正しい名詞句か否かを判断することができる。これにより、品詞による語形変換がないため品詞の曖昧性により誤訳が生じやすい中国語の翻訳において、高精度な翻訳結果を得ることができる。 As described above in detail, according to the first embodiment, in the analysis unit 50, the noun phrase candidate extraction unit 22 extracts a word group that appears multiple times in the document as a noun phrase candidate character string ( S12), the consistency calculation unit 29 extracts syntax examples similar to a plurality of sentences including noun phrase candidate character strings from among a plurality of syntax examples included in the parsed corpus 40, and each of the plurality of sentences Based on the extracted syntax example, it is determined whether or not the extraction of the noun phrase candidate character string is correct (S14 to S22). As a result, by assuming a noun phrase candidate character string using a plurality of sentences and determining whether the assumption is correct, it is possible to accurately determine whether the noun phrase candidate character string is a correct noun phrase in the entire sentence. Can judge well. In this case, for example, a noun phrase candidate character string is assumed in one sentence, and it is determined whether or not the noun phrase candidate character string is a correct noun phrase with higher accuracy than the method of comparing with the syntax example of the parsed corpus 40. be able to. Thereby, since there is no word form conversion by part of speech, it is possible to obtain a highly accurate translation result in Chinese translation that is likely to be mistranslated due to ambiguity of part of speech.

また、本第１の実施形態では、名詞句候補文字列のうち文字数の多い名詞句候補を優先して、処理することとしている（Ｓ１４）。これにより、文字数の少ない名詞句候補文字列を先に処理した場合に生じる、文字数の多い名詞句が分断される事態の発生を回避し、高精度な構文解析を実現することができる。 In the first embodiment, a noun phrase candidate having a large number of characters in the noun phrase candidate character string is preferentially processed (S14). As a result, it is possible to avoid occurrence of a situation where a noun phrase having a large number of characters is divided, which occurs when a noun phrase candidate character string having a small number of characters is processed first, and to realize a highly accurate syntax analysis.

また、本第１の実施形態では、整合性計算部２９は、複数の文それぞれについて抽出された構文に基づいて、名詞句候補文字列の意味を特定する（意味記号を決定し、名詞句リストに登録する）。これにより、名詞句の意味を考慮した構文解析を行うことができるため、高精度な構文解析が可能となる。 In the first embodiment, the consistency calculation unit 29 identifies the meaning of the noun phrase candidate character string based on the syntax extracted for each of the plurality of sentences (determines the meaning symbol, and determines the noun phrase list). To register). As a result, it is possible to perform syntax analysis in consideration of the meaning of the noun phrase, thereby enabling high-accuracy syntax analysis.

また、本第１の実施形態では、解析部５０の解析結果に基づいて、翻訳部５２が文書の翻訳を実行するので、高精度な構文解析に基づく翻訳により、高精度な翻訳結果を得ることができる。 In the first embodiment, since the translation unit 52 translates the document based on the analysis result of the analysis unit 50, a high-precision translation result is obtained by translation based on the high-precision syntax analysis. Can do.

《第２の実施形態》
以下、第２の実施形態について、図１６〜図２４に基づいて、詳細に説明する。なお、本第２の実施形態では、１つの文に複数の名詞句候補文字列が含まれる場合の例について説明する。なお、翻訳端末１０の装置構成は、第１の実施形態と同様であるが、図１６に示すように、解析部５０が名詞句集合一時保存部２５としての機能を有している点が異なる。 << Second Embodiment >>
Hereinafter, the second embodiment will be described in detail with reference to FIGS. In the second embodiment, an example in which a plurality of noun phrase candidate character strings are included in one sentence will be described. The device configuration of the translation terminal 10 is the same as that of the first embodiment, except that the analysis unit 50 has a function as the noun phrase set temporary storage unit 25 as shown in FIG. .

図１７は、本第２の実施形態における、解析部５０の処理の概要を示すフローチャートである。図１７において、第１の実施形態と異なる処理については、ステップ番号に「’」を付して示している。 FIG. 17 is a flowchart showing an outline of processing of the analysis unit 50 in the second embodiment. In FIG. 17, processes different from those of the first embodiment are indicated by adding “′” to the step number.

図１７の処理では、翻訳対象（解析対象）の文書が入力されると、ステップＳ１２において、第１の実施形態と同様にして、名詞句候補抽出部２２が、名詞句候補抽出処理を実行する（図６参照）。なお、本第２の実施形態では、図１８に示すような特許文書が入力されたものとし、ステップＳ１２では、図１９（ａ）において下線を付して示す部分が名詞句候補文字列として抽出されたものとする。この場合、名詞句候補一時保存部２６には、図１９（ｂ）、図１９（ｃ）のデータが一時保存されたものとする。 In the process of FIG. 17, when a translation target (analysis target) document is input, the noun phrase candidate extraction unit 22 executes a noun phrase candidate extraction process in step S12, as in the first embodiment. (See FIG. 6). In the second embodiment, it is assumed that a patent document as shown in FIG. 18 is input. In step S12, the underlined portion in FIG. 19A is extracted as a noun phrase candidate character string. It shall be assumed. In this case, it is assumed that the data of FIG. 19B and FIG. 19C are temporarily stored in the noun phrase candidate temporary storage unit 26.

次いで、図１７のステップＳ１３’では、名詞句候補抽出部２２が、名詞句集合候補抽出処理のサブルーチンを実行する。具体的には、ステップＳ１３’においては、図２０に示す処理が実行される。 Next, in step S13 'of FIG. 17, the noun phrase candidate extraction unit 22 executes a subroutine of noun phrase set candidate extraction processing. Specifically, in step S13 ', the process shown in FIG. 20 is executed.

図２０の処理では、ステップＳ１０２において、名詞句候補抽出部２２が、名詞句候補のパラメータｉを１に設定するとともに、文のパラメータｋを１に設定する。次いで、ステップＳ１０４では、名詞句候補抽出部２２は、名詞句候補Ｎｉ（＝Ｎ１）を特定する。ここでは、名詞句候補ＩＤ＝ＣＡＮＤＩ−１の名詞句候補文字列が特定されたものとする。そして、名詞句候補抽出部２２は、特定した名詞句候補が文ｚｋ（＝ｚ１）の中に存在しているかどうかを確認する。なお、文ｚ１は、一例として、図１９（ａ）の原文ＩＤ＝Ｑ３０２であるものとする。 In the process of FIG. 20, in step S <b> 102, the noun phrase candidate extraction unit 22 sets the parameter i of the noun phrase candidate to 1 and sets the parameter k of the sentence to 1. Next, in step S104, the noun phrase candidate extraction unit 22 identifies the noun phrase candidate Ni (= N1). Here, it is assumed that a noun phrase candidate character string of noun phrase candidate ID = CANDI-1 is specified. And the noun phrase candidate extraction part 22 confirms whether the specified noun phrase candidate exists in the sentence zk (= z1). Note that the sentence z1 is, for example, the original sentence ID = Q302 in FIG.

次いで、ステップＳ１０６では、名詞句候補抽出部２２が、ｉがｉの最大値（ｉmax＝２）であるか否かを判断する。ここでの判断が否定されると、ステップＳ１０８において、名詞句候補抽出部２２は、ｉを１インクリメント（ｉ←ｉ＋１）し、ステップＳ１０４に戻る。 Next, in step S106, the noun phrase candidate extraction unit 22 determines whether i is the maximum value of i (imax = 2). If the determination here is negative, in step S108, the noun phrase candidate extraction unit 22 increments i by 1 (i ← i + 1), and returns to step S104.

ステップＳ１０４に戻ると、名詞句候補抽出部２２は、名詞句候補Ｎ２として名詞句候補ＩＤ＝ＣＡＮＤＩ−２の名詞句候補文字列を特定し、文ｚ１に該名詞句候補文字列が存在するか否かを確認する。その後、ステップＳ１０６における判断が肯定されると、ステップＳ１１０に移行し、名詞句候補抽出部２２は、文ｚｋ（＝ｚ１）に名詞句候補文字列が１以上存在していたか否かを判断する。 Returning to step S104, the noun phrase candidate extraction unit 22 specifies the noun phrase candidate character string of the noun phrase candidate ID = CANDI-2 as the noun phrase candidate N2, and whether the noun phrase candidate character string exists in the sentence z1. Confirm whether or not. Thereafter, when the determination in step S106 is affirmed, the process proceeds to step S110, and the noun phrase candidate extraction unit 22 determines whether one or more noun phrase candidate character strings exist in the sentence zk (= z1). .

このステップＳ１１０の判断が肯定されると、ステップＳ１１２に移行し、名詞句候補抽出部２２は、名詞句集合一時保存部２５に文ｚｋに存在していた名詞句候補文字列の情報を名詞句集合の情報として格納する。図１９（ｄ）には、名詞句集合一時保存部２５が一時保存するデータの一例が示されている。図１９（ｄ）に示すように、名詞句集合一時保存部２５に一時保存されるデータは、「名詞句集合ＩＤ」と、「名詞句集合」の項目を含んでいる。文ｚｋ（ＩＤ＝Ｑ３０２）の場合、名詞句候補ＩＤ＝ＣＡＮＤＩ−１、ＣＡＮＤＩ−２の両方が存在しているので、ステップＳ１１２では、名詞句集合として、名詞句集合ＩＤ＝１に示すような情報が格納されることになる。なお、名詞句候補抽出部２２は、名詞句集合一時保存部２５に既に保存されている情報については、重複して保存しないようにする。 When the determination in step S110 is affirmed, the process proceeds to step S112, and the noun phrase candidate extraction unit 22 uses the noun phrase candidate character string information that was present in the sentence zk in the noun phrase set temporary storage unit 25 as a noun phrase. Store as set information. FIG. 19D shows an example of data temporarily stored by the noun phrase set temporary storage unit 25. As shown in FIG. 19D, the data temporarily stored in the noun phrase set temporary storage unit 25 includes items of “noun phrase set ID” and “noun phrase set”. In the case of the sentence zk (ID = Q302), since both noun phrase candidate IDs = CANDI-1 and CANDI-2 exist, in step S112, as a noun phrase set, as shown in noun phrase set ID = 1 Information will be stored. Note that the noun phrase candidate extraction unit 22 does not store the information already stored in the noun phrase set temporary storage unit 25 in duplicate.

ステップＳ１１２の後、又はステップＳ１１０の判断が否定された場合には、ステップＳ１１４に移行する。ステップＳ１１４では、名詞句候補抽出部２２は、ｋがｋの最大値（ｋmax）と同一であるか否かを判断する。ここでの判断が否定された場合には、名詞句候補抽出部２２は、ステップＳ１１６において、ｉを１に戻すとともに、ｋを１インクリメントした後、ステップＳ１０２に戻る。その後は、ステップＳ１０４以降の処理をステップＳ１１４の判断が肯定されるまで実行する。そして、図１９（ｄ）に示すようなデータが名詞句集合一時保存部２５に格納され、ステップＳ１１４の判断が肯定された段階で、図２０の処理を終了する。その後は、図１７のステップＳ１４’に移行する。 After step S112 or when the determination in step S110 is negative, the process proceeds to step S114. In step S114, the noun phrase candidate extraction unit 22 determines whether k is the same as the maximum value (kmax) of k. If the determination is negative, the noun phrase candidate extraction unit 22 returns i to 1 and increments k by 1 in step S116, and then returns to step S102. Thereafter, the processes after step S104 are executed until the determination at step S114 is affirmed. Then, the data as shown in FIG. 19D is stored in the noun phrase set temporary storage unit 25, and the processing of FIG. 20 is terminated when the determination in step S114 is affirmed. Thereafter, the process proceeds to step S14 'in FIG.

図１７のステップＳ１４’に移行すると、名詞句置換部２４は、図１９（ｄ）の中から、１つの名詞句集合候補を特定する。例えば、名詞句置換部２４は、図１９（ｄ）の中から、名詞句集合ＩＤ＝１の名詞句集合候補を特定したものとする。 In step S14 'of FIG. 17, the noun phrase replacement unit 24 identifies one noun phrase set candidate from FIG. 19D. For example, it is assumed that the noun phrase replacement unit 24 identifies a noun phrase set candidate with the noun phrase set ID = 1 from FIG.

次いで、ステップＳ１６’では、名詞句置換部２４は、文書中の名詞句候補文字列を暫定名詞に置換する。この場合、ステップＳ１４’で特定した名詞句集合候補に含まれる名詞句候補文字列を、図２１（ａ）に示すように、暫定名詞[**1]、[**2]を用いて置換する。 Next, in step S16 ', the noun phrase replacement unit 24 replaces the noun phrase candidate character string in the document with a provisional noun. In this case, the noun phrase candidate character strings included in the noun phrase set candidates identified in step S14 ′ are replaced using provisional nouns [** 1] and [** 2] as shown in FIG. To do.

次いで、ステップＳ１８’では、整合性計算部２９は、名詞句候補文字列に対する意味記号候補の抽出処理を実行する。このステップＳ１８’では、図２１（ａ）の各文を図２１（ｂ）のように構文解析し、各構文構造と類似する構文構造事例が構文解析済みコーパス４０に含まれているかどうかを検索する。例えば、図２２（ａ）に示す文（Ｑ３０２）は、図２２（ｂ）に示す構文構造事例（Ｃ１００１，Ｃ１００２）と類似していたとする。この場合、図２２（ｃ）に示すように、[**1]の意味候補リストに意味記号候補[中子]と[細菌]が保存され、[**2]の意味候補リストに意味記号候補[土壌]と「金属」が保存される。以下、同様に、他の文についても構文解析が行われるが、意味候補リストは、図２２（ｃ）のままであったとする。 Next, in step S18 ', the consistency calculation unit 29 executes a semantic symbol candidate extraction process for the noun phrase candidate character string. In this step S18 ′, each sentence in FIG. 21A is parsed as shown in FIG. 21B, and a search is made as to whether or not a syntax structure example similar to each syntax structure is included in the parsed corpus 40. To do. For example, it is assumed that the sentence (Q302) shown in FIG. 22A is similar to the syntax structure case (C1001, C1002) shown in FIG. In this case, as shown in FIG. 22C, semantic symbol candidates [core] and [bacteria] are stored in the semantic candidate list of [** 1], and semantic symbols are stored in the semantic candidate list of [** 2]. Candidate [soil] and “metal” are saved. Hereinafter, similarly, parsing is also performed on other sentences, but it is assumed that the semantic candidate list remains as shown in FIG.

次いで、ステップＳ１９では、整合性計算部２９は、意味記号候補を抽出できたか否かを判断する。ここでの判断が否定された場合には、ステップＳ２６’に移行するが、肯定された場合には、ステップＳ２０’に移行する。 Next, in step S19, the consistency calculation unit 29 determines whether or not a semantic symbol candidate has been extracted. If the determination is negative, the process proceeds to step S26 '. If the determination is positive, the process proceeds to step S20'.

ステップＳ２０’では、整合性計算部２９は、名詞句集合候補が意味記号候補の場合の整合性を評価する処理を実行する。ここでは、整合性計算部２９は、図２２（ｄ）に示すように、図２２（ｃ）の意味候補リストに基づいて、各暫定名詞の仮説（仮説１〜４）を立て、第１の実施形態と同様、整合性評価値を算出する。例えば、仮説１を採用した場合の構文解析の結果、図２３（ａ）に示すように、原文ＩＤ＝Ｑ３０８の文のみ、構文解析済みコーパス４０に一致事例が存在していたとする。この場合、仮説１の整合性評価値は、１／４＝０．２５となる。同様に、仮説２を採用した場合の構文解析の結果、図２３（ｂ）に示すように、原文ＩＤ＝Ｑ３０２の文のみ、構文解析済みコーパス４０に一致事例が存在していたとする。この場合、仮説２の整合性評価値は、１／４＝０．２５となる。また、仮説３を採用した場合の構文解析の結果、図２３（ｃ）に示すように、原文ＩＤ＝Ｑ３０２、Ｑ３０４，Ｑ３０８の文に一致事例が存在していたとする。この場合、仮説３の整合性評価値は、３／４＝０．７５となる。更に、仮説４を採用した場合の構文解析の結果、図２３（ｄ）に示すように、原文ＩＤ＝Ｑ３０４の文のみ、一致事例が存在していたとする。この場合、仮説４の整合性評価値は、１／４＝０．２５となる。 In step S20 ', the consistency calculation unit 29 executes a process for evaluating consistency when the noun phrase set candidate is a semantic symbol candidate. Here, as shown in FIG. 22 (d), the consistency calculation unit 29 makes hypotheses (hypotheses 1 to 4) of the provisional nouns based on the semantic candidate list in FIG. Similar to the embodiment, the consistency evaluation value is calculated. For example, as a result of syntactic analysis when hypothesis 1 is adopted, it is assumed that a matching case exists in the parsed corpus 40 for only the sentence with the original sentence ID = Q308 as shown in FIG. In this case, the consistency evaluation value of Hypothesis 1 is 1/4 = 0.25. Similarly, as a result of the syntax analysis when the hypothesis 2 is adopted, it is assumed that a matching case exists in the parsed corpus 40 only for the sentence with the original sentence ID = Q302 as shown in FIG. In this case, the consistency evaluation value of Hypothesis 2 is 1/4 = 0.25. Further, as a result of the syntax analysis when the hypothesis 3 is adopted, as shown in FIG. 23C, it is assumed that a matching case exists in the sentences of the original sentence ID = Q302, Q304, and Q308. In this case, the consistency evaluation value of Hypothesis 3 is 3/4 = 0.75. Further, as a result of the syntax analysis when hypothesis 4 is adopted, it is assumed that a matching case exists only in the sentence with the original sentence ID = Q304 as shown in FIG. In this case, the consistency evaluation value of Hypothesis 4 is 1/4 = 0.25.

以上のように、ステップＳ２０’において、図２３（ａ）〜図２３（ｄ）の整合性評価値を得ると、整合性計算部２９は、次のステップＳ２２’に移行する。 As described above, when the consistency evaluation values shown in FIGS. 23A to 23D are obtained in step S20 ', the consistency calculator 29 proceeds to the next step S22'.

ステップＳ２２’に移行すると、整合性評価値が最大、かつ閾値（例えば、０．５）を超えた仮説が存在するか否かを判断する。ここでの判断が否定された場合には、ステップＳ２６’に移行するが、肯定された場合には、ステップＳ２４に移行する。ステップＳ２４に移行すると、整合性計算部２９は、図２４に示すように、名詞句リストに意味記号を登録した後、ステップＳ２６’に移行する。 In step S22 ', it is determined whether or not there is a hypothesis whose consistency evaluation value is the maximum and exceeds a threshold value (for example, 0.5). If the determination is negative, the process proceeds to step S26 '. If the determination is positive, the process proceeds to step S24. When the process proceeds to step S24, the consistency calculating unit 29 registers the semantic symbols in the noun phrase list as illustrated in FIG. 24, and then proceeds to step S26 '.

ステップＳ２６’に移行すると、整合性計算部２９は、全ての名詞句集合候補を特定したか否かを判断する。このステップＳ２６’の判断が否定された場合には、ステップＳ１４’に戻り、上述した処理を繰り返す。なお、図１９（ｄ）の名詞句集合ＩＤ＝２、３の名詞句集合のように、既に前の処理（名詞句集合ＩＤ＝１の処理）において意味記号が確定している名詞句候補文字列のみを含む集合については、ステップＳ１４’以降の処理を実行しなくてもよい。一方、ステップＳ２６’の判断が肯定された場合には、ステップＳ２８に移行する。 In step S26 ', the consistency calculation unit 29 determines whether all noun phrase set candidates have been specified. If the determination in step S26 'is negative, the process returns to step S14' and the above-described processing is repeated. In addition, like the noun phrase set ID = 2, 3 noun phrase set in FIG. 19D, the noun phrase candidate characters whose semantic symbols have already been determined in the previous process (process of noun phrase set ID = 1). For a set including only columns, the processing after step S14 ′ need not be executed. On the other hand, if the determination in step S26 'is affirmative, the process proceeds to step S28.

ステップＳ２８では、第１の実施形態と同様、構文解析部３０は、名詞句リスト（図２４）に基づいて名詞句を意味記号に置換して、文書の解析を行う。そして、ステップＳ３０では、構文解析部３０は、出力部３４を介してステップＳ２８の解析結果を翻訳部５２に対して出力する。 In step S28, as in the first embodiment, the syntax analysis unit 30 performs document analysis by replacing noun phrases with semantic symbols based on the noun phrase list (FIG. 24). In step S <b> 30, the syntax analysis unit 30 outputs the analysis result of step S <b> 28 to the translation unit 52 via the output unit 34.

以上、詳細に説明したように、本第２の実施形態によると、第１の実施形態と同様の効果が得られるほか、複数種類の名詞句候補文字列を含む文が存在する場合に、名詞句候補抽出部２２は、複数種類の名詞句候補文字列の少なくとも１つを含む文を前記文書の中から特定し、整合性計算部２９は、特定した文それぞれと類似する構文事例を構文解析済みコーパス４０から抽出する。そして、整合性計算部２９は抽出した構文事例に基づいて、複数の名詞句候補文字列が正しい名詞句であるか否かを判断する。これにより、複数種類の名詞句候補文字列が正しい名詞句であるか否かを一度に判断することができるので、複数種類の名詞句候補文字列を用いた総合的な判断を行うことができる。これにより、１つずつ名詞句候補文字列が正しい名詞句であるか否かを判断する場合よりも、高精度な判断が可能となる。 As described above in detail, according to the second embodiment, the same effect as that of the first embodiment can be obtained, and when there is a sentence including a plurality of types of noun phrase candidate character strings, The phrase candidate extraction unit 22 identifies a sentence including at least one of a plurality of types of noun phrase candidate character strings from the document, and the consistency calculation unit 29 parses a syntax example similar to each of the identified sentences. Extract from the finished corpus 40. Then, the consistency calculation unit 29 determines whether or not the plurality of noun phrase candidate character strings are correct noun phrases based on the extracted syntax examples. As a result, it can be determined at a time whether or not a plurality of types of noun phrase candidate character strings are correct noun phrases, so that a comprehensive determination using a plurality of types of noun phrase candidate character strings can be performed. . As a result, it is possible to determine with higher accuracy than when determining whether the noun phrase candidate character strings are correct noun phrases one by one.

なお、上記第１、第２の実施形態では、解析部５０及び翻訳部５２を翻訳端末１０が有する場合について説明したが、これに限られるものではない。例えば、図２５に示すように、ネットワーク１８０に接続されたサーバ１１０が解析部５０や翻訳部５２を有していても良い。この場合、クライアント１２０から翻訳対象の文書を入力することで、該文書がサーバ１１０において高精度に翻訳され、翻訳文がサーバ１１０からクライアント１２０に対して出力されるようになる。なお、図２５の場合、解析部５０及び翻訳部５２のいずれかをクライアント１２０が有していてもよい。 In the first and second embodiments, the case where the translation terminal 10 includes the analysis unit 50 and the translation unit 52 has been described. However, the present invention is not limited to this. For example, as illustrated in FIG. 25, the server 110 connected to the network 180 may include an analysis unit 50 and a translation unit 52. In this case, by inputting the document to be translated from the client 120, the document is translated with high accuracy in the server 110, and the translated sentence is output from the server 110 to the client 120. In the case of FIG. 25, the client 120 may have one of the analysis unit 50 and the translation unit 52.

なお、上記第１、第２の実施形態では、中国語から日本語への翻訳を例にとり説明したが、これに限られるものではない。中国語以外の、品詞による語形変化がない言語の翻訳において、上記第１、第２の実施形態の装置や方法を用いることとしてもよい。また、中国語等から日本語以外の言語に翻訳する場合に、上記第１、第２の実施形態の装置や方法を用いることとしてもよい。 In the first and second embodiments, the translation from Chinese to Japanese has been described as an example. However, the present invention is not limited to this. In the translation of languages other than Chinese that have no change in word form due to part of speech, the devices and methods of the first and second embodiments may be used. Further, when translating from Chinese or the like into a language other than Japanese, the devices and methods of the first and second embodiments may be used.

なお、上記の処理機能は、コンピュータによって実現することができる。その場合、処理装置が有すべき機能の処理内容を記述したプログラムが提供される。そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体（ただし、搬送波は除く）に記録しておくことができる。 The above processing functions can be realized by a computer. In that case, a program describing the processing contents of the functions that the processing apparatus should have is provided. By executing the program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium (except for a carrier wave).

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ（Digital Versatile Disc）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）などの可搬型記録媒体の形態で販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When the program is distributed, for example, it is sold in the form of a portable recording medium such as a DVD (Digital Versatile Disc) or a CD-ROM (Compact Disc Read Only Memory) on which the program is recorded. It is also possible to store the program in a storage device of a server computer and transfer the program from the server computer to another computer via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、サーバコンピュータからプログラムが転送されるごとに、逐次、受け取ったプログラムに従った処理を実行することもできる。 The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. Further, each time the program is transferred from the server computer, the computer can sequentially execute processing according to the received program.

上述した実施形態は本発明の好適な実施の例である。但し、これに限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変形実施可能である。 The above-described embodiment is an example of a preferred embodiment of the present invention. However, the present invention is not limited to this, and various modifications can be made without departing from the scope of the present invention.

なお、以上の第１、第２の実施形態の説明に関して、更に以下の付記を開示する。
（付記１）文書内に複数回出現する単語群を名詞句候補として仮定する仮定部と、
構文リストに含まれる複数の正しい構文のうち、前記名詞句候補を含む複数の文と類似する構文をそれぞれ抽出する抽出部と、
前記抽出部による抽出結果に基づいて、前記名詞句候補の仮定が正しいか否かを判断する判断部と、を備える文書解析装置。
（付記２）前記名詞句候補が複数存在する場合、
前記抽出部は、文字数の多い名詞句候補を優先的に選択し、選択した前記名詞句候補を含む複数の文と類似する構文をそれぞれ抽出することを特徴とする付記１に記載の文書解析装置。
（付記３）前記抽出部が抽出した構文に基づいて、前記名詞句候補の意味を特定する特定部、を更に備える付記１又は２に記載の文書解析装置。
（付記４）前記文書内に、複数種類の名詞句候補を含む文が存在する場合に、
前記抽出部は、前記複数種類の名詞句候補の少なくとも１つを含む複数の文を前記文書の中から特定して、特定した前記複数の文それぞれと類似する構文を前記構文リストから抽出し、
前記判断部は、前記抽出部による抽出結果に基づいて、前記複数種類の名詞句候補の仮定が正しいか否かを判断する、ことを特徴とする付記１〜３のいずれかに記載の文書解析装置。
（付記５）前記判断部の判断結果に基づいて、前記文書の翻訳を実行する翻訳部を更に備える付記１〜４のいずれかに記載の文書解析装置。
（付記６）文書内に複数回出現する単語群を名詞句候補として仮定し、
構文リストに含まれる複数の正しい構文のうち、前記名詞句候補を含む複数の文と類似する構文をそれぞれ抽出し、
抽出した結果に基づいて、前記名詞句候補の仮定が正しいか否かを判断する、
処理をコンピュータに実行させる文書解析プログラム。
（付記７）前記名詞句候補が複数存在する場合、
前記抽出する処理では、文字数の多い名詞句候補を優先的に選択し、選択した前記名詞句候補を含む複数の文と類似する構文をそれぞれ抽出することを特徴とする付記６に記載の文書解析プログラム。
（付記８）前記抽出する処理において抽出された構文に基づいて、前記名詞句候補の意味を特定する、処理を前記コンピュータに更に実行させる付記６又は７に記載の文書解析プログラム。
（付記９）前記文書内に、複数種類の名詞句候補を含む文が存在する場合に、
前記抽出する処理では、前記複数種類の名詞句候補の少なくとも１つを含む複数の文を前記文書の中から特定して、特定した前記複数の文それぞれと類似する構文を前記構文リストから抽出し、
前記判断する処理では、前記抽出する処理における抽出結果に基づいて、前記複数種類の名詞句候補の仮定が正しいか否かを判断する、ことを特徴とする付記６〜８のいずれかに記載の文書解析プログラム。
（付記１０）前記判断する処理における判断結果に基づいて、前記文書を翻訳する、処理を前記コンピュータに実行させる付記６〜９のいずれかに記載の文書解析プログラム。
（付記１１）文書内に複数回出現する単語群を名詞句候補として仮定し、
構文リストに含まれる複数の正しい構文のうち、前記名詞句候補を含む複数の文と類似する構文をそれぞれ抽出し、
抽出した結果に基づいて、前記名詞句候補の仮定が正しいか否かを判断する、
処理をコンピュータが実行する文書解析方法。
（付記１２）前記名詞句候補が複数存在する場合、
前記抽出する処理では、文字数の多い名詞句候補を優先的に選択し、選択した前記名詞句候補を含む複数の文と類似する構文をそれぞれ抽出することを特徴とする付記１１に記載の文書解析方法。
（付記１３）前記抽出する処理において抽出された構文に基づいて、前記名詞句候補の意味を特定する、処理を前記コンピュータが更に実行する付記１１又は１２に記載の文書解析方法。
（付記１４）前記文書内に、複数種類の名詞句候補を含む文が存在する場合に、
前記抽出する処理では、前記複数種類の名詞句候補の少なくとも１つを含む複数の文を前記文書の中から特定して、特定した前記複数の文それぞれと類似する構文を前記構文リストから抽出し、
前記判断する処理では、前記抽出する処理における抽出結果に基づいて、前記複数種類の名詞句候補の仮定が正しいか否かを判断する、ことを特徴とする付記１１〜１３のいずれかに記載の文書解析方法。
（付記１５）前記判断する処理における判断結果に基づいて、前記文書を翻訳する、処理を前記コンピュータが更に実行する付記１１〜１４のいずれかに記載の文書解析方法。 In addition, regarding the above description of the first and second embodiments, the following additional notes are disclosed.
(Supplementary note 1) An assumption part that assumes a group of words appearing multiple times in a document as a noun phrase candidate;
An extraction unit that extracts a plurality of correct syntaxes included in the syntax list, each of which is similar to a plurality of sentences including the noun phrase candidate,
A document analysis device comprising: a determination unit that determines whether or not the assumption of the noun phrase candidate is correct based on an extraction result by the extraction unit.
(Supplementary Note 2) When there are a plurality of noun phrase candidates,
The document analysis apparatus according to appendix 1, wherein the extraction unit preferentially selects noun phrase candidates having a large number of characters, and extracts syntaxes similar to a plurality of sentences including the selected noun phrase candidates. .
(Additional remark 3) The document analysis apparatus of Additional remark 1 or 2 further provided with the specific | specification part which specifies the meaning of the said noun phrase candidate based on the syntax which the said extraction part extracted.
(Supplementary Note 4) When a sentence including plural types of noun phrase candidates exists in the document,
The extraction unit identifies a plurality of sentences including at least one of the plurality of types of noun phrase candidates from the document, extracts a syntax similar to each of the plurality of identified sentences from the syntax list,
The document analysis according to any one of appendices 1 to 3, wherein the determination unit determines whether or not the assumptions of the plurality of types of noun phrase candidates are correct based on an extraction result by the extraction unit. apparatus.
(Supplementary note 5) The document analysis device according to any one of supplementary notes 1 to 4, further comprising a translation unit that performs translation of the document based on a determination result of the determination unit.
(Supplementary Note 6) Assuming a word group that appears multiple times in a document as a noun phrase candidate,
Of the plurality of correct syntaxes included in the syntax list, respectively extract syntaxes similar to the plurality of sentences including the noun phrase candidates,
Based on the extracted results, determine whether the noun phrase candidate assumption is correct,
A document analysis program that causes a computer to execute processing.
(Supplementary Note 7) When there are a plurality of noun phrase candidates,
The document analysis according to appendix 6, wherein the extracting process preferentially selects a noun phrase candidate having a large number of characters and extracts a plurality of sentences similar to a plurality of sentences including the selected noun phrase candidate. program.
(Additional remark 8) The document analysis program of Additional remark 6 or 7 which makes the said computer further perform the process which specifies the meaning of the said noun phrase candidate based on the syntax extracted in the said process to extract.
(Supplementary Note 9) When there is a sentence including plural types of noun phrase candidates in the document,
In the extracting process, a plurality of sentences including at least one of the plurality of types of noun phrase candidates are identified from the document, and syntax similar to each of the identified plurality of sentences is extracted from the syntax list. ,
The determination process determines whether or not the assumptions of the plurality of types of noun phrase candidates are correct based on an extraction result in the extraction process. Document analysis program.
(Supplementary note 10) The document analysis program according to any one of supplementary notes 6 to 9, which causes the computer to execute a process of translating the document based on a determination result in the determination process.
(Supplementary Note 11) Assuming a word group that appears multiple times in a document as a noun phrase candidate,
Of the plurality of correct syntaxes included in the syntax list, respectively extract syntaxes similar to the plurality of sentences including the noun phrase candidates,
Based on the extracted results, determine whether the noun phrase candidate assumption is correct,
A document analysis method in which processing is executed by a computer.
(Supplementary Note 12) When there are a plurality of noun phrase candidates,
12. The document analysis according to appendix 11, wherein the extracting process preferentially selects noun phrase candidates having a large number of characters, and extracts syntaxes similar to a plurality of sentences including the selected noun phrase candidates. Method.
(Supplementary note 13) The document analysis method according to supplementary note 11 or 12, wherein the computer further executes a process of specifying the meaning of the noun phrase candidate based on the syntax extracted in the extracting process.
(Supplementary Note 14) When there is a sentence including plural types of noun phrase candidates in the document,
In the extracting process, a plurality of sentences including at least one of the plurality of types of noun phrase candidates are identified from the document, and syntax similar to each of the identified plurality of sentences is extracted from the syntax list. ,
The determination process determines whether or not the assumptions of the plural types of noun phrase candidates are correct based on an extraction result in the extraction process. Document analysis method.
(Supplementary note 15) The document analysis method according to any one of supplementary notes 11 to 14, wherein the computer further executes a process of translating the document based on a determination result in the determination process.

１０翻訳端末（文書解析装置）
２２名詞句候補抽出部（仮定部）
２９整合性計算部（抽出部、判断部、特定部）
４０構文解析済みコーパス（構文リスト）
５２翻訳部 10 Translation terminal (document analysis device)
22 Noun phrase candidate extraction part (assuming part)
29 Consistency calculation unit (extraction unit, determination unit, identification unit)
40 parsed corpus (syntax list)
52 Translation Department

Claims

A hypothesis that assumes words that appear multiple times in the document as noun phrase candidates;
An extraction unit that extracts a plurality of correct syntaxes included in the syntax list, each of which is similar to a plurality of sentences including the noun phrase candidate,
A document analysis apparatus comprising: a determination unit that determines whether or not the assumption of the noun phrase candidate is correct based on whether the extraction unit has extracted syntax similar to a plurality of sentences including the noun phrase candidate; .

When there are a plurality of noun phrase candidates,
2. The document analysis according to claim 1, wherein the extraction unit preferentially selects noun phrase candidates having a large number of characters, and respectively extracts syntaxes similar to a plurality of sentences including the selected noun phrase candidates. apparatus.

The document analysis apparatus according to claim 1, further comprising: a specifying unit that specifies the meaning of the noun phrase candidate based on the syntax extracted by the extraction unit.

When there is a sentence including a plurality of types of noun phrase candidates in the document,
The extraction unit identifies a plurality of sentences including at least one of the plurality of types of noun phrase candidates from the document, extracts a syntax similar to each of the plurality of identified sentences from the syntax list,
The determination unit is correct in assuming the plurality of types of noun phrase candidates based on whether the extraction unit has extracted a syntax similar to a plurality of sentences including at least one of the plurality of types of noun phrase candidates. The document analysis apparatus according to claim 1, wherein the document analysis apparatus determines whether or not.

The document analysis apparatus according to claim 1, further comprising a translation unit that performs translation of the document based on a determination result of the determination unit.

Assuming a group of words that appear multiple times in the document as noun phrase candidates,
Of the plurality of correct syntaxes included in the syntax list, respectively extract syntaxes similar to the plurality of sentences including the noun phrase candidates,
Determining whether or not the noun phrase candidate assumption is correct based on whether or not a syntax similar to a plurality of sentences including the noun phrase candidate can be extracted in the extracting process;
A document analysis program that causes a computer to execute processing.

Assuming part assumes a group of words appearing multiple times in the document as a noun phrase candidate,
The extraction unit extracts a plurality of correct syntaxes included in the syntax list, respectively, and extracts a syntax similar to a plurality of sentences including the noun phrase candidate,
The determination unit determines whether the assumption of the noun phrase candidate is correct based on whether the extraction unit has extracted a syntax similar to a plurality of sentences including the noun phrase candidate.
A document parsing method that performs processing.