JP2014067179A

JP2014067179A - Document processor and document processing program

Info

Publication number: JP2014067179A
Application number: JP2012211368A
Authority: JP
Inventors: Guowei Zu; 国威祖; Toshiyuki Kano; 敏行加納
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2012-09-25
Filing date: 2012-09-25
Publication date: 2014-04-17
Also published as: CN103678476A

Abstract

PROBLEM TO BE SOLVED: To determine a place which is easily subjected to mistranslation in document data.SOLUTION: A document processor includes: input means for inputting document data; analysis means for analyzing a sentence of the document data input by the input means; affix extraction means for extracting a predetermined affix from the analysis result of the analysis means; rule storage means for storing an affix inspection rule in which a determination reference for determining whether or not a word including the predetermined affix is easily subjected to mistranslation is associated with the predetermined affix and a compound word extraction rule for extracting a compound word including an affix which is easily subjected to mistranslation; determination means for, when the word including the affix extracted from the analysis result satisfies the determination reference in the affix inspection rule, determining that the extracted affix is an affix which is easily subjected to mistranslation; and compound word extraction means for, when the word including the affix extracted from the analysis result satisfies the compound word inspection rule, extracting a compound word including the extracted affix following the rule as a compound word which is easily subjected to mistranslation.

Description

本発明の実施形態は、文書データの機械翻訳に用いられる文書処理装置及び文書処理プログラムに関する。 Embodiments described herein relate generally to a document processing apparatus and a document processing program used for machine translation of document data.

従来、文書データの機械翻訳を行う際、翻訳辞書に登録されていない複合語に、直訳では訳しにくい接辞、例えば「可」、「未」、「無」などが含まれていると、機械翻訳エンジンが複合語の意味を判別しにくくなり、誤訳が発生しやすくなる。 Conventionally, when machine translation of document data is performed, if compound words that are not registered in the translation dictionary include affixes that are difficult to translate directly, such as “Yes”, “No”, “No”, etc. It becomes difficult for the engine to determine the meaning of the compound word, and mistranslation is likely to occur.

複合語とは、本来独立した単語が二つ以上結合して、新たに一つの語としての意味や機能をもつようになった語である。特に技術文章では、複合語である用語がよく使われる。一般に、複合語の種類は多いため、人手で辞書に複合語を網羅的に登録するのは困難である。 A compound word is a word in which two or more originally independent words are combined to have a new meaning and function as one word. Especially in technical sentences, terms that are compound words are often used. In general, since there are many types of compound words, it is difficult to manually register compound words in a dictionary manually.

誤訳しやすい複合語の第１の例である「文書管理システム未導入部門」に対して、日英機械翻訳を行うと、翻訳結果は「Department introduced a document management system not」という誤訳になることがある。この誤訳の原因は、機械翻訳エンジンが前述した「文書管理システム未導入部門」中の「未導入」の接辞「未」を正しく理解できていないことである。 When a Japanese-English machine translation is performed on the first example of a compound word that is easy to mistranslate, “Document Management System Not Installed”, the translation result may be “Department introduced a document management system not”. is there. The cause of this mistranslation is that the machine translation engine does not correctly understand the abbreviation “not yet” of “not yet installed” in the “document management system not yet introduced section”.

また、前述した複合語「文書管理システム未導入部門」に対して日中機械翻訳で翻訳を行うと、翻訳結果は「文件管理系統綿羊引進部門」という誤訳になることがある。この訳では、翻訳エンジンは、前述した「未導入」の接辞「未」を「綿羊」と訳してしまっている。 In addition, if the compound word “document management system unintroduced department” is translated by machine translation between Japan and China, the translation result may be mistranslated as “sentence management system sheep promotion department”. In this translation, the translation engine translates the abbreviation “not yet” of “not yet introduced” as “cotton sheep”.

また、誤訳しやすい複合語の第２の例として、「変換元パターン」を日英機械翻訳で翻訳すると、使用する機械翻訳エンジンの種別が異なることで、以下の訳文Ａと訳文Ｂのような異なる結果になる。
訳文Ａ： the former pattern of conversion.
訳文Ｂ： the pattern of a changing agency.
使用する機械翻訳エンジンの種別が異なることで訳文が異なる原因は、前述した「変換元パターン」中の「変換元」の接辞「元」に対する機械翻訳エンジンの理解が種別ごとに異なることにある。訳文Ａに示す例では、機械翻訳エンジンは、「変換元パターン」を「変換の元パターン」であると理解している。一方、訳文Ｂに示す例では、機械翻訳エンジンは、「変換元パターン」を「変換元のパターン」であると理解している。このような、原文に対する理解の揺れは、機械翻訳だけではなく、人手による翻訳でも発生する。 Moreover, as a second example of a compound word that is easily mistranslated, if the “translation source pattern” is translated by Japanese-English machine translation, the type of machine translation engine to be used is different. Different results.
Translation A: the former pattern of conversion.
Translation B: the pattern of a changing agency.
The reason why the translation differs depending on the type of machine translation engine to be used is that the understanding of the machine translation engine for the affix “source” of “conversion source” in the “conversion source pattern” described above differs for each type. In the example shown in the translation A, the machine translation engine understands that the “conversion source pattern” is the “conversion source pattern”. On the other hand, in the example shown in the translated sentence B, the machine translation engine understands that the “conversion source pattern” is the “conversion source pattern”. Such fluctuations in understanding of the original text occur not only in machine translation but also in manual translation.

このような誤訳の問題を解決するために、機械翻訳で複合語を処理する前に、翻訳元の文章から誤訳しやすい複合語を自動的に発見できれば、この発見した複合語を翻訳辞書に登録することによって、翻訳の精度を向上させることができる。 To solve such mistranslation problems, if a compound word that is likely to be mistranslated can be automatically found in the source sentence before the compound word is processed by machine translation, this compound word is registered in the translation dictionary. By doing so, the accuracy of translation can be improved.

誤訳しやすい複合語の診断について、例えば、品詞や字種などの並びを基準にして頻出する文字列情報を用いて、辞書に未登録の用語（複合語を含む）を抽出する技術が開示されている。 For the diagnosis of compound words that are easily mistranslated, for example, a technique for extracting unregistered terms (including compound words) using dictionary information that frequently appears on the basis of the arrangement of parts of speech or character types is disclosed. ing.

また、翻訳の目的言語の単語共起情報を利用して、訳文に原言語のままの未訳出文字列に対する訳語候補を出力する技術が開示されている。
さらに、翻訳原文からハイフンなどの特殊文字をはさんで結合された複合語を検出し、未登録複合語を構成要素毎に辞書を引き、その結果から得られる複合語の構造情報を用いて翻訳用の知識を出力する技術が開示されている。 In addition, a technique is disclosed that uses word co-occurrence information of a target language of translation to output a translation word candidate for an untranslated character string in the original language as a translation.
In addition, it detects compound words that are joined by interposing special characters such as hyphens from the original translation, draws a dictionary for each unregistered compound word, and translates it using the compound word structure information obtained from the result. A technique for outputting knowledge for use is disclosed.

特開２００２−３４２３２１号公報JP 2002-342321 A 特開２０００−２５９６３０号公報JP 2000-259630 A 特開平６−２９５３１１号公報Japanese Patent Application Laid-Open No. Hei 6-295311

前述した、辞書に未登録の用語を抽出する技術は、接辞を考慮していないため、全ての未登録語を抽出対象とし、直訳できる、つまり辞書への登録が不要な複合語（例えば、「変換パターン」）も出力してしまう。よって、複合語を辞書に登録する際に、登録不要な語を人手により除く必要があり、手間がかかる。 Since the technique for extracting unregistered terms described above does not consider affixes, all unregistered words can be extracted, and can be translated directly, that is, compound words that do not require registration in the dictionary (for example, “ Conversion pattern ") is also output. Therefore, when registering a compound word in the dictionary, it is necessary to manually remove a word that does not need to be registered, which is troublesome.

また、前述した、訳文に原言語のままの未訳出文字列に対する訳語候補を出力する技術は、訳文中に原言語のまま出力される未訳出語のみを診断対象とするので、前述した、誤訳しやすい複合語の第１の例や第２の例のように、翻訳結果に未訳出語が含まれていない複合語を発見することができない。 In addition, the above-described technique for outputting candidate translations for untranslated character strings in the source language in the translated text only diagnoses untranslated words that are output in the source language in the translated text. As in the first and second examples of easy-to-use compound words, it is not possible to find a compound word that does not include an untranslated word in the translation result.

さらに、前述した、複合語の構造情報を用いて翻訳用の知識を出力する技術は、ハイフンなどの特殊文字を診断の手掛りとしているため、日本語の漢字や仮名が連続した複合語を発見できない。 Furthermore, the technology that outputs the knowledge for translation using the structure information of compound words mentioned above uses special characters such as hyphens as a clue for diagnosis, so it cannot find compound words with continuous kanji and kana in Japanese. .

本発明が解決しようとする課題は、文書データにおける誤訳しやすい箇所を判断することが可能になる文書処理装置及び文書処理プログラムを提供することにある。 The problem to be solved by the present invention is to provide a document processing apparatus and a document processing program capable of determining a mistranslatable portion in document data.

実施形態によれば、文書処理装置は、文書データを入力する入力手段と、前記入力手段により入力した文書データの文を解析する解析手段と、前記解析手段による解析結果から所定の接辞を抽出する接辞抽出手段と、前記所定の接辞を含む語が誤訳しやすいか否かの判断基準を前記所定の接辞と対応付けた接辞検査規則、および誤訳しやすい接辞を含む複合語を抽出するための複合語抽出規則を記憶する規則記憶手段と、前記解析結果から抽出された接辞を含む語が前記接辞検査規則における判断基準を満たす際に、前記抽出された接辞を誤訳しやすい接辞であると判定する判定手段と、前記解析結果から抽出された接辞を含む語が前記複合語検査規則を満たす際に、この規則にしたがった、前記抽出された接辞を含む複合語を誤訳しやすい複合語として抽出する複合語抽出手段とをもつ。 According to the embodiment, the document processing apparatus extracts a predetermined affix from an input unit for inputting document data, an analysis unit for analyzing a sentence of the document data input by the input unit, and an analysis result by the analysis unit. Affix extraction means, an affix inspection rule that associates a criterion for determining whether a word including the predetermined affix is easily mistranslated with the predetermined affix, and a compound for extracting a compound word including an affix that is easily mistranslated A rule storage means for storing a word extraction rule, and when a word including the affix extracted from the analysis result satisfies a criterion in the affix check rule, the extracted affix is determined to be an affix that is easily mistranslated. When the word including the affix extracted from the analysis means and the determination result satisfies the compound word inspection rule, the compound word including the extracted affix according to the rule is easily mistranslated. With a compound word extraction means for extracting as if language.

第１の実施形態における文書処理装置のハードウエア構成の一例を示すブロック図。FIG. 2 is a block diagram illustrating an example of a hardware configuration of the document processing apparatus according to the first embodiment. 第１の実施形態における文書処理装置の機能構成例を示すブロック図。FIG. 3 is a block diagram illustrating an example of a functional configuration of the document processing apparatus according to the first embodiment. 第１の実施形態における文書処理装置の接辞辞書格納部に格納されている接辞辞書の一例を表形式で示す図。The figure which shows an example of the affix dictionary stored in the affix dictionary storage part of the document processing apparatus in 1st Embodiment by a table format. 第１の実施形態における文書処理装置の診断規則格納部に格納されている接辞検査規則の一例を表形式で示す図。The figure which shows an example of the affix check rule stored in the diagnostic rule storage part of the document processing apparatus in 1st Embodiment by a table format. 第１の実施形態における文書処理装置の診断規則格納部に格納されている複合語抽出規則の一例を表形式で示す図。The figure which shows an example of the compound word extraction rule stored in the diagnostic rule storage part of the document processing apparatus in 1st Embodiment by a table format. 第１の実施形態における文書処理装置の処理動作手順の一例を示すフローチャート。6 is a flowchart illustrating an example of a processing operation procedure of the document processing apparatus according to the first embodiment. 第１の実施形態における文書処理装置による入力文の構文解析結果の一例を示す図。6 is a diagram illustrating an example of a syntax analysis result of an input sentence by the document processing apparatus according to the first embodiment. FIG. 第１の実施形態における文書処理装置による誤訳しやすい複合語の診断結果の一例を示す図。The figure which shows an example of the diagnostic result of the compound word which is easy to be mistranslated by the document processing apparatus in 1st Embodiment. 第２の実施形態における文書処理装置の機能構成例を示すブロック図。FIG. 9 is a block diagram illustrating an example of a functional configuration of a document processing apparatus according to a second embodiment. 第２の実施形態における文書処理装置の診断規則格納部に格納される接辞検査規則の一例を表形式で示す図。The figure which shows an example of the affix check rule stored in the diagnostic rule storage part of the document processing apparatus in 2nd Embodiment by a table format. 第２の実施形態における文書処理装置の診断規則格納部に格納される複合語抽出規則の一例を表形式で示す図。The figure which shows an example of the compound word extraction rule stored in the diagnostic rule storage part of the document processing apparatus in 2nd Embodiment by a table format. 第２の実施形態における文書処理装置による処理動作の一例を示すフローチャート。10 is a flowchart illustrating an example of a processing operation performed by the document processing apparatus according to the second embodiment. 第２の実施形態における文書処理装置による入力文の形態素解析結果の一例を示す図。The figure which shows an example of the morphological analysis result of the input sentence by the document processing apparatus in 2nd Embodiment.

以下、実施の形態について、図面を参照して説明する。
（第１の実施形態）
まず、第１の実施形態について説明する。
図１は、第１の実施形態における文書処理装置のハードウエア構成の一例を示すブロック図である。
図１に示すように、第１の実施形態における文書処理装置３０は、コンピュータ１０および外部記憶装置２０を有する。コンピュータ１０は、外部記憶装置２０と接続される。この外部記憶装置２０は、コンピュータ１０によって実行されるプログラム（文書処理プログラム）２１を格納する。外部記憶装置２０は、ハードディスクドライブや不揮発性メモリなどである。
文書処理装置３０は、例えばユーザによって指定された文を提示し、誤訳しやすい複合語を診断するための指示を受け付け、診断結果を出力する機能を有する。 Hereinafter, embodiments will be described with reference to the drawings.
(First embodiment)
First, the first embodiment will be described.
FIG. 1 is a block diagram illustrating an example of a hardware configuration of a document processing apparatus according to the first embodiment.
As shown in FIG. 1, the document processing apparatus 30 in the first embodiment includes a computer 10 and an external storage device 20. The computer 10 is connected to the external storage device 20. The external storage device 20 stores a program (document processing program) 21 executed by the computer 10. The external storage device 20 is a hard disk drive or a non-volatile memory.
For example, the document processing apparatus 30 has a function of presenting a sentence designated by the user, receiving an instruction for diagnosing a compound word that is easily mistranslated, and outputting a diagnosis result.

図２は、第１の実施形態における文書処理装置の機能構成例を示すブロック図である。図２に示すように、コンピュータ１０は、入力部３１、構文解析部３２、接辞抽出部３３、接辞検査部３４、複合語抽出部３５、出力部３６、接辞辞書格納部３７、及び診断規則格納部３８を含む。本実施形態において、これらの各部は、図１に示す外部記憶装置２０に格納されているプログラム２１をコンピュータ１０が実行することにより実現されるものとする。 FIG. 2 is a block diagram illustrating a functional configuration example of the document processing apparatus according to the first embodiment. As shown in FIG. 2, the computer 10 includes an input unit 31, a syntax analysis unit 32, an affix extraction unit 33, an affix check unit 34, a compound word extraction unit 35, an output unit 36, an affix dictionary storage unit 37, and a diagnostic rule storage. Part 38 is included. In the present embodiment, these units are realized by the computer 10 executing the program 21 stored in the external storage device 20 shown in FIG.

プログラム２１は、コンピュータ読み取り可能な記憶媒体に予め格納して頒布可能である。また、このプログラム２１が、例えばネットワークを介してコンピュータ１０によってダウンロードされる形態でもよい。また、本実施形態において、接辞辞書格納部３７、診断規則格納部３８は、例えば図１に示す外部記憶装置２０に格納される。 The program 21 can be stored in advance in a computer-readable storage medium and distributed. Further, the program 21 may be downloaded by the computer 10 via a network, for example. In the present embodiment, the affix dictionary storage unit 37 and the diagnostic rule storage unit 38 are stored in, for example, the external storage device 20 shown in FIG.

接辞辞書格納部３７には、接辞とそのタイプを登録した接辞辞書が予め格納される。図３は、第１の実施形態における文書処理装置の接辞辞書格納部３７に格納されている接辞辞書の一例を表形式で示す図である。
図３に示すように、接辞辞書には、複数種類の見出しと、この見出しに対応する接辞タイプとが記述される。接辞辞書で記述される見出しは、他の語と所定の係り受けにある際に誤訳しやすい接辞である。接辞タイプは、接辞の種類を表し、後述する接辞検査規則を参照するときに用いられる情報である。図３の例では、接辞タイプは「Ａ」、「Ｂ」、「Ｃ」の３種類であり、接辞の見出し「当」、「非」が接辞タイプ「Ａ」に属し、接辞の見出し「未」、「無」、「時」、「前」が接辞タイプ「Ｂ」に属し、接辞の見出し「可」、「元」が接辞タイプ「Ｃ」に属する。 In the affix dictionary storage unit 37, an affix dictionary in which affixes and their types are registered is stored in advance. FIG. 3 is a diagram illustrating an example of the affix dictionary stored in the affix dictionary storage unit 37 of the document processing apparatus according to the first embodiment in a table format.
As shown in FIG. 3, the affix dictionary describes a plurality of types of headings and affix types corresponding to the headings. A headline described in an affix dictionary is an affix that is easily mistranslated when in a predetermined dependency with another word. The affix type represents the type of affix and is information used when referring to an affix inspection rule described later. In the example of FIG. 3, there are three types of affix types “A”, “B”, and “C”, the affix headings “T” and “Non” belong to the affix type “A”, and the affix headings “unknown” ”,“ No ”,“ Time ”, and“ Previous ”belong to the affix type“ B ”, and the affix headings“ Yes ”and“ Original ”belong to the affix type“ C ”.

診断規則格納部３８には、接辞検査規則と複合語抽出規則が予め格納される。
接辞検査規則は、入力文における検査の対象の接辞が誤訳しやすいかどうかを判断するためのルールである。
図４は、第１の実施形態における文書処理装置の診断規則格納部３８に格納されている接辞検査規則の一例を表形式で示す図である。
図４に示すように、接辞検査規則では、接辞辞書に記述された接辞タイプ毎に、この接辞タイプに対応する、検査の対象の接辞が誤訳しやすいか否かを判定するための判定基準が用意される。図４に示した例では、接辞タイプが「Ａ」である際に、「接辞ノードと親ノードが「連語」の係り受けの場合、誤訳しやすいと判定する。」という判定基準である。 In the diagnostic rule storage unit 38, affix checking rules and compound word extraction rules are stored in advance.
The affix checking rule is a rule for determining whether the affix to be checked in the input sentence is easily mistranslated.
FIG. 4 is a diagram showing an example of the affix check rule stored in the diagnosis rule storage unit 38 of the document processing apparatus according to the first embodiment in a table format.
As shown in FIG. 4, in the affix check rule, for each affix type described in the affix dictionary, a criterion for determining whether or not the affix to be inspected corresponding to this affix type is likely to be mistranslated. Prepared. In the example illustrated in FIG. 4, when the affix type is “A”, “if the affix node and the parent node are“ collocation ”, it is determined that mistranslation is likely to occur. Is a determination criterion.

また、図３に示したような接辞辞書と図４に示したような接辞検査規則を一体化して、接辞タイプを定めずに、接辞自体のそれぞれに判定基準を用意した規則を用いるようにしてもよい。また、前述したように、接辞タイプを定めて、接辞辞書と接辞検査規則とをそれぞれ用いたほうが、接辞検査規則の構成を簡略化できることはもちろんである。 In addition, the affix dictionary as shown in FIG. 3 and the affix check rule as shown in FIG. Also good. Further, as described above, it is a matter of course that the configuration of the affix check rule can be simplified by determining the affix type and using the affix dictionary and the affix check rule, respectively.

図５は、第１の実施形態における文書処理装置の診断規則格納部３８に格納されている複合語抽出規則の一例を表形式で示す図である。
図５に示した複合語抽出規則は、誤訳しやすい接辞を含む複合語の境界を判断し、複合語を抽出するルールである。 FIG. 5 is a diagram illustrating an example of a compound word extraction rule stored in the diagnosis rule storage unit 38 of the document processing apparatus according to the first embodiment in a table format.
The compound word extraction rule shown in FIG. 5 is a rule for extracting a compound word by determining a boundary of a compound word including an affix that is easily mistranslated.

入力部３１は、例えばキーボード又はマウス等に対するユーザの操作に応じて、文書データの入力を受け付ける。
構文解析部３２は、入力部３１により入力した文書データ中の入力文に対して構文解析を行い、解析結果を出力する。 The input unit 31 receives input of document data in response to a user operation on, for example, a keyboard or a mouse.
The syntax analysis unit 32 performs syntax analysis on the input sentence in the document data input by the input unit 31, and outputs an analysis result.

接辞抽出部３３は、接辞辞書格納部３７に格納されている接辞辞書に基づいて、検査の対象とすべき接辞が入力文に含まれている際に、この接辞を抽出する。当該接辞は接辞検査の対象となる。 The affix extraction unit 33 extracts an affix based on the affix dictionary stored in the affix dictionary storage unit 37 when an affix to be examined is included in the input sentence. The affix is subject to an affix check.

接辞検査部３４は、構文解析部３２による構文解析の結果に対して、接辞抽出部３３により抽出した、接辞検査の対象となる接辞の接辞タイプを用いて、診断規則格納部３８に格納されている接辞検査規則を参照する。接辞検査部３４は、構文解析部３２による構文解析結果で示される、検査対象の接辞を含んだ語が、接辞検査規則における、この検査対象の接辞の接辞タイプに対応する判定基準に当てはまる場合、この当てはまる語における該当の接辞を誤訳しやすい接辞であるとする。 The affix checking unit 34 is stored in the diagnostic rule storage unit 38 by using the affix type of the affix to be subjected to the affix check extracted by the affix extraction unit 33 with respect to the result of the syntax analysis by the syntax analysis unit 32. Refers to the affix checking rule. The affix checking unit 34, when the word including the affix to be inspected, which is indicated by the syntax analysis result by the syntax analysis unit 32, matches the criterion corresponding to the affix type of the affix to be inspected in the affix checking rule, It is assumed that the corresponding affix in the applicable word is easy to mistranslate.

複合語抽出部３５は、診断規則格納部３８に格納されている複合語抽出規則に基づいて、誤訳しやすい接辞を含む複合語の境界を判断し、誤訳しやすい複合語を抽出する。
出力部３６は、入力文に対して、複合語診断の結果、つまり接辞検査部３４による検査結果である誤訳しやすい接辞、および、複合語抽出部３５により抽出した、誤訳しやすい複合語を利用者向けに出力する。 Based on the compound word extraction rules stored in the diagnostic rule storage unit 38, the compound word extraction unit 35 determines the boundaries of compound words including affixes that are easily mistranslated, and extracts compound words that are easily mistranslated.
The output unit 36 uses a compound word diagnosis result, that is, an affix that is easily mistranslated, which is a test result by the affix checking unit 34, and a compound word that is easily mistranslated extracted by the compound word extraction unit 35. Output for users.

次に、図６に示すフローチャートを参照して、ユーザによって入力された文に対する構文解析結果に基づき、複合語診断を行い、診断結果を生成し、ユーザに出力する際の本実施形態にかかる文書処理装置の処理手順について説明する。 Next, referring to the flowchart shown in FIG. 6, a document according to the present embodiment when a compound word diagnosis is performed based on a syntax analysis result for a sentence input by the user, a diagnosis result is generated, and the result is output to the user. A processing procedure of the processing apparatus will be described.

（１）入力文の取得
入力部３１に対して、ユーザによって検査対象の文を入力するための操作がなされると、入力部３１は、この入力された文を取得する（ステップＳ１）。この文は、ユーザがキーボード等から直接入力しても良いし、既存のファイルから読み込んでも良い。 (1) Acquisition of Input Sentence When the user performs an operation on the input unit 31 to input a sentence to be examined, the input unit 31 acquires the input sentence (step S1). This sentence may be input directly by the user from a keyboard or may be read from an existing file.

（２）構文解析
構文解析部３２は、入力文に対して、構文解析を行う（ステップＳ２）。
図７は、第１の実施形態における文書処理装置による入力文が「登録前の変換元パターンを出力する。」である際の構文解析結果の一例を示す図である。
図７に示した構文解析結果における楕円の中には、入力文の各文節の語幹が記されている。これをノードと呼ぶ。図７において、係り受け関係にある２つのノードが矢印で結ばれる。この矢印をアークと呼ぶ。矢印の先に連なるノードを親ノードと呼び、矢印の元に連なるノードを子ノードと呼ぶ。楕円の中で、山カッコ（<>）で囲まれている語は、この楕円に対応するノードの品詞である。 (2) Syntax analysis The syntax analysis unit 32 performs syntax analysis on the input sentence (step S2).
FIG. 7 is a diagram illustrating an example of a syntax analysis result when the input sentence by the document processing apparatus according to the first embodiment is “output the conversion source pattern before registration”.
In the ellipse in the parsing result shown in FIG. 7, the stem of each clause of the input sentence is written. This is called a node. In FIG. 7, two nodes having a dependency relationship are connected by an arrow. This arrow is called an arc. A node connected to the end of the arrow is called a parent node, and a node connected to the arrow is called a child node. The word enclosed in angle brackets (<>) in the ellipse is the part of speech of the node corresponding to this ellipse.

また、ノード間の矢印には、この矢印で結ばれる２つのノードの係り受け関係の説明が付されている。例えば、図７に示すように、子ノード「パターン」と親ノード「出力する」との間の矢印に「ヲ格」が付されているときは、この矢印で結ばれる子ノード「パターン」と親ノード「出力する」との係り受け関係がヲ格であることを意味する。 Further, an explanation of the dependency relationship between two nodes connected by this arrow is given to the arrow between the nodes. For example, as shown in FIG. 7, when “wo” is attached to the arrow between the child node “pattern” and the parent node “output”, the child node “pattern” connected by the arrow is This means that the dependency relationship with the parent node “output” is unequaled.

また、図７に示すように、子ノード「名詞」と親ノード「前」との間の矢印に「連語」が付されているときは、この矢印で結ばれる子ノード「登録」と親ノード「前」とが、連語関係であることを意味する。同様に、矢印に「連語」が付される子ノード「変換」と親ノード「元」とが、連語関係であることを意味し、矢印に「連語」が付される子ノード「元」と親ノード「パターン」とが、連語関係であることを意味する。ここで、連語関係とは、複数の形態素からなるが、まとまった形で単語と同様に用いられる言語表現を指している。 Also, as shown in FIG. 7, when “collocation” is attached to the arrow between the child node “noun” and the parent node “previous”, the child node “registration” and the parent node connected by this arrow “Previous” means a collocation relationship. Similarly, the child node “transformation” with “cold” attached to the arrow and the parent node “original” have a collocation relationship, and the child node “original” with “cold” attached to the arrow This means that the parent node “pattern” has a collocation relationship. Here, the collocation relationship refers to a linguistic expression that is composed of a plurality of morphemes but is used in the same way as words.

（３）接辞抽出
接辞抽出部３３は、接辞辞書格納部３７に格納されている接辞辞書の見出しを参照しながら、入力文の構文解析結果に、接辞辞書に登録された接辞が含まれているかを判断する（ステップＳ３）。構文解析結果に、接辞辞書に登録された接辞が含まれている場合、接辞抽出部３３は、該当接辞を検査対象とする。 (3) Affix extraction Whether the affix extraction unit 33 includes the affix registered in the affix dictionary in the syntax analysis result of the input sentence while referring to the headings of the affix dictionary stored in the affix dictionary storage unit 37 Is determined (step S3). When the affix registered in the affix dictionary is included in the parsing result, the affix extraction unit 33 sets the corresponding affix as an inspection target.

例えば、前述した入力文「登録前の変換元パターンを出力する。」の構文解析結果には、接辞辞書に登録されている接辞「元」と接辞「前」とが含まれていることが分かる。接辞抽出部３３は、図３に示す接辞辞書を参照すると、入力文「登録前の変換元パターンを出力する。」に含まれる接辞「元」と接辞「前」が何れも接辞辞書に含まれているので、両者とも接辞の検査対象とする。 For example, the syntax analysis result of the input sentence “output the conversion source pattern before registration” described above includes the affix “original” and the affix “previous” registered in the affix dictionary. . When referring to the affix dictionary shown in FIG. 3, the affix extraction unit 33 includes both the affix “original” and the affix “previous” included in the input sentence “output the conversion source pattern before registration”. Both are subject to affix inspection.

（４）接辞検査
接辞検査部３４は、入力文に検査対象の接辞が含まれている際は、入力文の構文解析結果に基づいて、診断規則格納部３８に格納されている接辞検査規則を適用し、構文解析結果が、この接辞検査規則における、検査対象とした接辞の接辞タイプに対応する判定基準に当てはまるかを検査する（ステップＳ４）。接辞検査部３４は、構文解析結果が判定基準に当てはまるとき、該当の検査対象とした接辞を誤訳しやすい接辞と判断する。 (4) Affix check When the input sentence contains an affix to be checked, the affix check unit 34 determines the affix check rule stored in the diagnostic rule storage unit 38 based on the syntax analysis result of the input sentence. Apply and check whether the syntax analysis result matches the criterion corresponding to the affix type of the affix to be inspected in this affix inspection rule (step S4). When the parsing result matches the criterion, the affix checking unit 34 determines that the affixed as the subject of inspection is an affix that is easily mistranslated.

例えば、前述した入力文「登録前の変換元パターンを出力する。」において検査対象とされた接辞「元」の接辞タイプは、図３に示した接辞辞書における「Ｃ」であるから、接辞検査部３４は、図４に示す接辞検査規則の接辞タイプ「Ｃ」に対応する判定基準と構文解析結果とを照合する。図４に示した例では、接辞タイプ「Ｃ」に対応する判定基準は「接辞ノードとその子ノードが「連語」の係り受け関係の場合、誤訳しやすいと判断する。」である。 For example, the affix type of the affix “original” that is the object of inspection in the input sentence “output the conversion source pattern before registration” is “C” in the affix dictionary shown in FIG. The unit 34 collates the determination criterion corresponding to the affix type “C” of the affix check rule illustrated in FIG. 4 with the syntax analysis result. In the example shown in FIG. 4, the determination criterion corresponding to the affix type “C” is “if the affix node and its child node have a“ collocation ”dependency relationship, it is determined that mistranslation is likely to occur. Is.

図７に示すように、構文解析結果における検査対象の接辞である「元」のノードは、このノードを親ノードとした際の子ノード「変換」との係り受け関係が「連語」である。そのため、接辞検査部３４は、構文解析結果における検査対象の接辞である「元」を誤訳しやすい接辞と判断する。 As shown in FIG. 7, the “original” node, which is the affix to be inspected in the syntax analysis result, has a dependency relationship with the child node “transformation” when this node is the parent node. Therefore, the affix checking unit 34 determines that “original”, which is the affix to be inspected in the syntax analysis result, is an affix that is easily mistranslated.

一方、前述した入力文「登録前の変換元パターンを出力する。」において検査対象とされた接辞「前」の接辞タイプは図３に示した接辞辞書における「Ｂ」であるから、接辞検査部３４は、図４に示す接辞検査規則の接辞タイプ「Ｂ」に対応する判定基準と構文解析結果とを照合する。図４に示した例では、接辞タイプ「Ｂ」の判定基準は、「接辞ノードの、親ノードも子ノードも「連語」の係り受けの場合、誤訳しやすいと判定する。」である。詳しくは、この判定基準は、接辞ノードとその親ノードが「連語」の係り受けであって、この接辞ノードとその子ノードも「連語」の係り受けの場合、誤訳しやすいと判定することを示す。 On the other hand, since the affix type of the affix “previous” to be examined in the input sentence “output the conversion source pattern before registration” is “B” in the affix dictionary shown in FIG. 34 collates the determination standard corresponding to the affix type “B” of the affix check rule shown in FIG. 4 with the syntax analysis result. In the example illustrated in FIG. 4, the determination criterion for the affix type “B” is that it is easy to make a mistranslation if the parent node and the child node of the affix node are “collocation”. Is. Specifically, this criterion indicates that if an affix node and its parent node are a “collocation” dependency and this affix node and its child node are also a “collocation” dependency, it is determined that it is easy to mistranslate. .

図７に示すように、構文解析結果における検査対象の接辞である「前」のノードは、このノードを親ノードとした際の子ノード「登録」との係り受け関係が「連語」である。しかし、構文解析結果における接辞である「前」のノードは、このノードを子ノードとした際の親ノード「変換」との係り受け関係が「連語」ではない。よって、接辞検査部３４は、構文解析結果における検査対象の接辞である接辞「前」を誤訳しやすい接辞とは判断しない。 As shown in FIG. 7, the “previous” node, which is the affix to be inspected in the syntax analysis result, has a dependency relationship with the child node “registration” when this node is the parent node. However, the “previous” node, which is an affix in the parsing result, does not have a dependency relationship with the parent node “transformation” when this node is a child node. Therefore, the affix checking unit 34 does not determine that the affix “previous”, which is the affix to be inspected in the syntax analysis result, is easily mistranslated.

（５）複合語抽出
次に、複合語抽出部３５は、接辞検査部３４による処理の結果、入力文に誤訳しやすい接辞が含まれる場合、入力文の構文解析結果に基づいて、診断規則格納部３８に格納されている複合語抽出規則を用いて、該当接辞を含む複合語の境界を決定し、入力文から複合語を抽出する（ステップＳ５）。 (5) Compound Word Extraction Next, the compound word extraction unit 35 stores diagnostic rules based on the syntax analysis result of the input sentence when an affix that is easily mistranslated is included in the input sentence as a result of processing by the affix checking unit 34. The compound word extraction rule stored in the unit 38 is used to determine the boundary of the compound word including the corresponding affix, and the compound word is extracted from the input sentence (step S5).

例えば、複合語抽出部３５は、図７に示した構文解析結果に対して、図５に示す複合語抽出規則に基づいて、前述のように誤訳しやすいとされた接辞「元」を含む複合語の境界を判断する。図５に示す複合語抽出規則は、「接辞から、連語関係にあるノードを親子ともにまとめて１つの複合語とする。」である。図７に示した構文解析結果では、接辞「元」から連語関係にある子ノードが「変換」であり、接辞「元」から連語関係にある親ノードが「パターン」であるため、複合語抽出部３５は、複合語抽出規則にしたがって、入力文の「変換元パターン」を、接辞「元」を含む複合語とする。この複合語は、誤訳しやすい接辞を含んでいるので、誤訳しやすい複合語となる。 For example, the compound word extraction unit 35 includes the affix “original” that is easily mistranslated as described above based on the compound word extraction rule shown in FIG. 5 with respect to the syntax analysis result shown in FIG. Determine word boundaries. The compound word extraction rule shown in FIG. 5 is “From affix, nodes in collocation are grouped together into a single compound word”. In the syntax analysis result shown in FIG. 7, the compound node is extracted because the child node having the collocation relationship from the affix “original” is “transformation” and the parent node having the collocation relationship from the affix “form” is “pattern”. According to the compound word extraction rule, the unit 35 sets the “conversion source pattern” of the input sentence to a compound word including the affix “original”. Since this compound word includes an affix that is easily mistranslated, the compound word is easily mistranslated.

（６）出力
出力部３６は、接辞検査部３４で発見された誤訳しやすい接辞、及び、複合語抽出部３５で抽出された複合語を含む診断結果を出力する（ステップＳ６）。診断結果の出力は、例えば液晶ディスプレイに表示を行うことや、ＣＳＶファイルで診断結果一覧を出力することや、文書ファイルのコメントとして指摘メッセージを記載することでなされる。 (6) Output The output unit 36 outputs a diagnostic result including an affix that is easily mistranslated by the affix checking unit 34 and a compound word extracted by the compound word extracting unit 35 (step S6). The diagnosis result is output by, for example, displaying on a liquid crystal display, outputting a list of diagnosis results as a CSV file, or writing an indication message as a comment of a document file.

図８は、第１の実施形態における文書処理装置による誤訳しやすい複合語の診断結果の一例を示す図である。
図８に示すように、診断結果では、翻訳辞書に未登録の複合語が入力文に含まれていることと、この複合語には、誤訳しやすい接辞が含まれていることを示すメッセージが示されており、この複合語が誤訳しやすい複合語であることが示される。 FIG. 8 is a diagram illustrating an example of a compound word diagnosis result that is easily mistranslated by the document processing apparatus according to the first embodiment.
As shown in FIG. 8, in the diagnosis result, there is a message indicating that the input sentence includes a compound word that is not registered in the translation dictionary, and that this compound word includes an affix that is easily mistranslated. It is shown that this compound word is a compound word that is easily mistranslated.

以上のように、第１の実施形態における文書処理装置では、ユーザは、入力文の構文解析結果に基づいて、誤訳しやすい接辞および誤訳しやすい複合語を発見できる。そして、ユーザにより、診断結果に基づいて、誤訳しやすい複合語を翻訳辞書に登録することによって、以後の機械翻訳の精度を向上させることができる。このように、第１の実施形態における文書処理装置では、翻訳辞書に登録すべき用語を自動的に抽出できるため、翻訳辞書を拡張するためのユーザの負担を減軽できる。さらに、ユーザは、入力文から分かりにくい複合語を発見できるので、文書品質の向上を支援できる。 As described above, in the document processing apparatus according to the first embodiment, the user can find an affix that is easily mistranslated and a compound word that is easily mistranslated based on the syntax analysis result of the input sentence. Then, the accuracy of subsequent machine translation can be improved by registering a compound word that is easily mistranslated in the translation dictionary based on the diagnosis result. As described above, in the document processing apparatus according to the first embodiment, the terms to be registered in the translation dictionary can be automatically extracted, so that the burden on the user for expanding the translation dictionary can be reduced. Furthermore, since the user can find difficult-to-understand compound words from the input sentence, it can support the improvement of the document quality.

（第２の実施形態）
次に、第２の実施形態について説明する。なお、本実施形態に係る文書処理装置の構成のうち第１の実施形態と同一部分の説明は省略する。
図９は、第２の実施形態における文書処理装置の機能構成例を示すブロック図である。図９に示すように、第２の実施形態における文書処理装置３０のコンピュータ１０は、入力部３１１、形態素解析部３１２、接辞抽出部３１３、接辞検査部３１４、複合語抽出部３１５、出力部３１６、接辞辞書格納部３１７、及び診断規則格納部３１８を含む。 (Second Embodiment)
Next, a second embodiment will be described. Note that, in the configuration of the document processing apparatus according to this embodiment, the description of the same parts as those of the first embodiment is omitted.
FIG. 9 is a block diagram illustrating a functional configuration example of the document processing apparatus according to the second embodiment. As shown in FIG. 9, the computer 10 of the document processing apparatus 30 according to the second embodiment includes an input unit 311, a morpheme analysis unit 312, an affix extraction unit 313, an affix check unit 314, a compound word extraction unit 315, and an output unit 316. , An affix dictionary storage unit 317, and a diagnostic rule storage unit 318.

第１の実施形態と比較して、第２の実施形態における文書処理装置は、構文解析部３２の代わりに、形態素解析部３１２を備えることを特徴とする。本実施形態において、接辞辞書格納部３１７、診断規則格納部３１８は、例えば図１に示す外部記憶装置２０に格納される。接辞辞書格納部３１７には、第１の実施形態と同様に、図３に示したような接辞辞書が格納される。 Compared with the first embodiment, the document processing apparatus according to the second embodiment includes a morpheme analysis unit 312 instead of the syntax analysis unit 32. In this embodiment, the affix dictionary storage unit 317 and the diagnostic rule storage unit 318 are stored in, for example, the external storage device 20 illustrated in FIG. The affix dictionary storage unit 317 stores an affix dictionary as shown in FIG. 3 as in the first embodiment.

また、本実施形態では、診断規則格納部３１８に格納される接辞検査規則と複合語抽出規則は、形態素解析の結果に基づく規則であり、第１の実施形態で説明した、構文解析結果に基づく接辞検査規則や複合語抽出規則とは異なる。 In this embodiment, the affix check rule and the compound word extraction rule stored in the diagnostic rule storage unit 318 are rules based on the result of morphological analysis, and are based on the syntax analysis result described in the first embodiment. It is different from affix checking rules and compound word extraction rules.

図１０は、第２の実施形態における文書処理装置の診断規則格納部に格納される接辞検査規則の一例を表形式で示す図である。
図１０に示した接辞検査規則は、接辞辞書に記述された接辞タイプ毎に、この接辞タイプに対応する、検査の対象の接辞が誤訳しやすいか否かを判定するための判定基準が用意される。図１０に示した例では、接辞タイプが「Ａ」である際に、「接辞の直後の形態素の品詞が「名詞」の場合、誤訳しやすいと判定する。」という判定基準がある。 FIG. 10 is a diagram illustrating an example of the affix check rule stored in the diagnosis rule storage unit of the document processing apparatus according to the second embodiment in a table format.
In the affix inspection rule shown in FIG. 10, for each affix type described in the affix dictionary, a criterion for determining whether the affix to be inspected corresponding to this affix type is likely to be mistranslated is prepared. The In the example illustrated in FIG. 10, when the affix type is “A”, “if the part of speech of the morpheme immediately after the affix is“ noun ”, it is determined that it is easy to mistranslate. There is a criterion called “

図１１は、第２の実施形態における文書処理装置の診断規則格納部に格納される複合語抽出規則の一例を表形式で示す図である。
図１１に示した複合語抽出規則は、誤訳しやすい接辞を含む複合語の境界を判断し、複合語を抽出するルールである。 FIG. 11 is a diagram illustrating an example of a compound word extraction rule stored in the diagnosis rule storage unit of the document processing apparatus according to the second embodiment in a table format.
The compound word extraction rule shown in FIG. 11 is a rule for determining a boundary of a compound word including an affix that is easily mistranslated and extracting the compound word.

入力部３１１は、入力部３１と同様に、例えばキーボード又はマウス等に対するユーザの操作に応じて、ユーザからの指示を受け付ける。
形態素解析部３１２は、入力文に対して形態素解析を行い、解析結果を出力する。
接辞抽出部３１３は、接辞辞書格納部３１７に格納されている接辞辞書に基づいて、検査の対象とすべき接辞が入力文に含まれている際に、これを抽出する。当該接辞は接辞検査の対象となる。 Similar to the input unit 31, the input unit 311 accepts an instruction from the user, for example, according to a user operation on a keyboard or a mouse.
The morpheme analysis unit 312 performs morpheme analysis on the input sentence and outputs an analysis result.
Based on the affix dictionary stored in the affix dictionary storage unit 317, the affix extraction unit 313 extracts an affix to be examined when it is included in the input sentence. The affix is subject to an affix check.

接辞検査部３１４は、形態素解析部３１２による形態素解析の結果に対して、接辞抽出部３１３により抽出した、入力文に含まれる、検査の対象となる接辞の接辞タイプを用いて、診断規則格納部３１８に格納されている接辞検査規則を参照する。接辞検査部３１４は、形態素解析部３１２による形態素解析の結果で示される、検査対象の接辞を含んだ語が、接辞検査規則における、この検査対象の接辞の接辞タイプに対応する判定基準に当てはまる場合、この当てはまる語における該当の接辞を誤訳しやすい接辞であるとする。
複合語抽出部３１５は、診断規則格納部３１８に格納されている複合語抽出規則に基づいて、誤訳しやすい接辞を含む複合語の境界を判断し、誤訳しやすい複合語を抽出する。 The affix checking unit 314 uses the affix type of the affix to be examined included in the input sentence extracted by the affix extraction unit 313 with respect to the result of the morpheme analysis by the morpheme analysis unit 312, and the diagnostic rule storage unit The affix check rule stored in 318 is referred to. When the affix checking unit 314 includes the affix to be inspected, which is indicated by the result of the morpheme analysis by the morpheme analysis unit 312, applies to the criterion corresponding to the affix type of the affix to be inspected in the affix checking rule Suppose that the corresponding affix in this applicable word is easy to mistranslate.
Based on the compound word extraction rules stored in the diagnostic rule storage unit 318, the compound word extraction unit 315 determines the boundaries of compound words including affixes that are easily mistranslated, and extracts compound words that are easily mistranslated.

出力部３１６は、入力文に対して、複合語診断の結果、つまり接辞検査部３１４による検査結果である誤訳しやすい接辞、および複合語抽出部３１５により抽出した、誤訳しやすい複合語を利用者向けに出力する。 The output unit 316 uses the compound word diagnosis result, that is, the affix that is easily mistranslated, which is the test result by the affix checking unit 314, and the compound word that is easily mistranslated extracted by the compound word extraction unit 315 for the user. Output for.

図１２は、第２の実施形態における文書処理装置による処理動作の一例を示すフローチャートである。
（１）入力文の取得
まず、ユーザによって検査対象の文が入力されると、入力部３１１は、この入力された文を取得する（ステップＳ１）。この文は、ユーザがキーボード等から直接入力しても良いし、既存のファイルから読み込んでも良い。 FIG. 12 is a flowchart illustrating an example of a processing operation performed by the document processing apparatus according to the second embodiment.
(1) Acquisition of Input Sentence First, when a sentence to be examined is input by a user, the input unit 311 acquires the input sentence (step S1). This sentence may be input directly by the user from a keyboard or may be read from an existing file.

（２）形態素解析
次に、形態素解析部３１２は、入力文に対する形態素解析を行う（ステップＳ１２）。ここでは、入力文は、第１の実施形態でも説明した「登録前の変換元パターンを出力する。」であるとする。 (2) Morphological Analysis Next, the morphological analysis unit 312 performs morphological analysis on the input sentence (step S12). Here, it is assumed that the input sentence is “output the conversion source pattern before registration” described in the first embodiment.

図１３は、第２の実施形態における文書処理装置による入力文が「登録前の変換元パターンを出力する。」である際の形態素解析結果の一例を示す図である。
図１３に示すように、形態素解析結果では、斜線「／」によって、入力文が形態素単位で区切られている。形態素解析結果において山カッコ（<>）で囲まれている語は、形態素の品詞である。 FIG. 13 is a diagram illustrating an example of a morphological analysis result when the input sentence by the document processing apparatus according to the second embodiment is “output the conversion source pattern before registration”.
As shown in FIG. 13, in the morpheme analysis result, the input sentence is divided in units of morpheme by a diagonal line “/”. In the morphological analysis results, words enclosed in angle brackets (<>) are morpheme parts of speech.

（３）接辞抽出
接辞抽出部３１３は、入力文の形態素解析結果に基づいて、第１の実施形態と同じように、接辞辞書格納部３１７に格納されている接辞辞書の見出しを参照しながら、入力文の形態素解析結果に、接辞辞書に登録された接辞が含まれているかを判断し、形態素解析結果に、接辞辞書に登録された接辞が含まれていれば、この接辞を検査対象の接辞として抽出する（ステップＳ１３）。 (3) Affix extraction The affix extraction unit 313 refers to the heading of the affix dictionary stored in the affix dictionary storage unit 317 based on the morphological analysis result of the input sentence, as in the first embodiment. It is determined whether the affix registered in the affix dictionary is included in the morphological analysis result of the input sentence. If the affix registered in the affix dictionary is included in the morpheme analysis result, this affix is the affix to be checked. (Step S13).

例えば、前述した入力文「登録前の変換元パターンを出力する。」の形態素解析結果には、接辞辞書に登録されている接辞「元」と接辞「前」とが含まれていることが分かる。接辞抽出部３１３は、図３に示す接辞辞書を参照すると、入力文「登録前の変換元パターンを出力する。」に含まれる接辞「元」と接辞「前」が何れも接辞辞書に含まれているので、両者とも接辞の検査対象とする。 For example, it is understood that the affix “original” and the affix “previous” registered in the affix dictionary are included in the morphological analysis result of the input sentence “output the conversion source pattern before registration”. . When the affix extraction unit 313 refers to the affix dictionary shown in FIG. 3, both the affix “original” and the affix “previous” included in the input sentence “output the conversion source pattern before registration” are included in the affix dictionary. Both are subject to affix inspection.

（４）接辞検査
接辞検査部３１４は、入力文に検査対象の接辞が含まれている際は、入力文の形態素解析結果に基づいて、診断規則格納部３１８に格納されている接辞検査規則を適用し、形態素解析結果が、この接辞検査規則における、検査対象とした接辞の接辞タイプに対応する判定基準に当てはまるかを検査する（ステップＳ１４）。接辞検査部３１４は、形態素解析結果が判定基準に当てはまるとき、該当接辞を誤訳しやすい接辞と判断する。 (4) Affix check When the input sentence includes an affix to be checked, the affix check unit 314 determines the affix check rule stored in the diagnostic rule storage unit 318 based on the morphological analysis result of the input sentence. It is applied to check whether the morphological analysis result applies to the determination criterion corresponding to the affix type of the affix to be inspected in this affix inspection rule (step S14). The affix checking unit 314 determines that the affix is an affix that is likely to be mistranslated when the morphological analysis result meets the criteria.

例えば、前述した入力文「登録前の変換元パターンを出力する。」において検査対象とされた接辞「元」の接辞タイプは、図３に示した接辞辞書における「Ｃ」であるから、接辞検査部３１４は、図１０に示す接辞検査規則の接辞タイプ「Ｃ」に対応する判定基準と形態素解析結果とを照合する。 For example, the affix type of the affix “original” that is the object of inspection in the input sentence “output the conversion source pattern before registration” is “C” in the affix dictionary shown in FIG. The unit 314 collates the determination standard corresponding to the affix type “C” of the affix inspection rule illustrated in FIG. 10 with the morphological analysis result.

図１０に示した例では、接辞タイプ「Ｃ」に対応する判定基準は「接辞の直前の形態素の品詞が「名詞」の場合、誤訳しやすいと判定する。」である。図１３に示す形態素解析結果では、検査対象の接辞「元」の直前の形態素「変換」の品詞が「名詞」であるので、接辞検査部３１４は、形態素解析結果における検査対象の接辞「元」を誤訳しやすい接辞と判断する。 In the example illustrated in FIG. 10, the determination criterion corresponding to the affix type “C” is “if the part of speech of the morpheme immediately before the affix is“ noun ”, it is determined that mistranslation is likely to occur. Is. In the morpheme analysis result shown in FIG. 13, since the part of speech of the morpheme “transformation” immediately before the affix “original” to be inspected is “noun”, the affix inspection unit 314 checks the affix “original” to be inspected in the morpheme analysis result. Is an affix that is easy to mistranslate.

一方、前述した入力文「登録前の変換元パターンを出力する。」において検査対象とされた接辞「前」の接辞タイプは図３に示した接辞辞書における「Ｂ」であるから、接辞検査部３１４は、図１０に示す接辞検査規則の接辞タイプ「Ｂ」に対応する判定基準と形態素解析結果とを照合する。 On the other hand, since the affix type of the affix “previous” to be examined in the input sentence “output the conversion source pattern before registration” is “B” in the affix dictionary shown in FIG. In step 314, the determination criterion corresponding to the affix type “B” in the affix inspection rule illustrated in FIG. 10 is collated with the morphological analysis result.

図１０示した例では、接辞タイプ「Ｂ」の判定基準は「接辞の直前及び直後の形態素の品詞が両方とも「名詞」の場合、誤訳しやすいと判定する。」である。図１３に示すように、形態素解析結果における検査対象の接辞である接辞「前」の直前の形態素「登録」の品詞は「名詞」であるが、接辞「前」の直後の形態素「の」の品詞は「名詞」ではない。従って、接辞検査部３１４は、形態素解析結果における検査対象の接辞「前」を誤訳しやすい接辞としない。 In the example shown in FIG. 10, the criterion for the affix type “B” is “if the morpheme immediately before and after the affix both have“ nouns ”, it is determined that they are easily mistranslated. Is. As shown in FIG. 13, the part of speech of the morpheme “registration” immediately before the affix “previous” which is the affix to be examined in the morpheme analysis result is “noun”, but the morpheme “no” immediately after the affix “previous” The part of speech is not a "noun". Accordingly, the affix checking unit 314 does not set the affix “previous” to be checked in the morphological analysis result as an affix that is easily mistranslated.

（５）複合語抽出
複合語抽出部３１５は、接辞検査部３１４による処理の結果、入力文に誤訳しやすい接辞が含まれる場合、入力文の形態素解析結果に基づいて、診断規則格納部３１８に格納されている複合語抽出規則を用いて、該当の誤訳しやすい接辞を含んでいる複合語の境界を決定し、複合語を抽出する（ステップＳ１５）。 (5) Compound word extraction The compound word extraction unit 315 stores the diagnosis rule storage unit 318 based on the morphological analysis result of the input sentence when an affix that is easily mistranslated is included in the input sentence as a result of processing by the affix checking unit 314. Using the stored compound word extraction rule, the boundary of the compound word including the corresponding affix that is easily mistranslated is determined, and the compound word is extracted (step S15).

例えば、複合語抽出部３１５は、図１３に示した形態素解析結果に対して、図１１に示す複合語抽出規則に基づいて、接辞「元」を含む複合語の境界を判断する。図１１に示す複合語抽出規則は、「接辞の前後の品詞が名詞である形態素をまとめて１つの複合語とする。」であるため、複合語抽出部３１５は、この規則によって、入力文の「変換元パターン」を、誤訳しやすいとされた接辞「元」を含んでいる複合語として抽出する。この複合語は、誤訳しやすい接辞を含んでいるので、誤訳しやすい複合語となる。 For example, the compound word extraction unit 315 determines the boundary of the compound word including the affix “original” based on the compound word extraction rule shown in FIG. 11 with respect to the morphological analysis result shown in FIG. Since the compound word extraction rule shown in FIG. 11 is “a morpheme whose part of speech before and after the affix is a noun is combined into one compound word”, the compound word extraction unit 315 uses this rule to The “conversion source pattern” is extracted as a compound word including the affix “original” that is easily mistranslated. Since this compound word includes an affix that is easily mistranslated, the compound word is easily mistranslated.

（６）出力
出力部３１６は、第１の実施形態と同じように、接辞検査部３１４で発見された誤訳しやすい接辞、及び、複合語抽出部３１５で抽出された複合語を含んでいる診断結果を、図８に示すように出力する（ステップＳ１６）。 (6) Output As in the first embodiment, the output unit 316 includes an affix that is easy to mistranslate discovered by the affix checking unit 314 and a compound word extracted by the compound word extraction unit 315. The result is output as shown in FIG. 8 (step S16).

以上説明したように、第２の実施形態における文書処理装置では、第１の実施形態のように構文解析を使わなくても、入力文の形態素解析結果のみを用いて、入力文中の誤訳しやすい複合語を発見できる。また、構文解析と比較して、形態素解析は、実現のためのコストを低くすることができるので、第２の実施形態では、第１の実施形態と比較して実現のためのコストが低いという利点がある。 As described above, the document processing apparatus according to the second embodiment is likely to be mistranslated in the input sentence using only the morphological analysis result of the input sentence without using the parsing as in the first embodiment. Discover compound words. In addition, compared to syntax analysis, morphological analysis can reduce the cost for realization, and therefore, in the second embodiment, the cost for realization is lower than that in the first embodiment. There are advantages.

なお、第１の実施形態と第２の実施形態は、それぞれ構文解析と形態素解析を利用しているが、解析手法はこれらに限られない。例えば、辞書に登録されていない複合語で、要注意な接辞を含むものを抽出するのに必要な情報がわかる解析であればよい。 The first embodiment and the second embodiment use syntax analysis and morphological analysis, respectively, but the analysis method is not limited to these. For example, any analysis that can identify information necessary to extract a compound word that is not registered in the dictionary and that includes a careful affix may be used.

これらの各実施形態によれば、文書データにおける誤訳しやすい箇所を判断することが可能になる文書処理装置及び文書処理プログラムを提供することができる。
また、プログラムを頒布するための記憶媒体としては、プログラムを記憶でき、かつコンピュータが読み取り可能な記憶媒体であれば、その記憶形式は何れの形態であっても良い。 According to each of these embodiments, it is possible to provide a document processing apparatus and a document processing program capable of determining a portion that is easily mistranslated in document data.
The storage medium for distributing the program may be in any form as long as the storage medium can store the program and can be read by the computer.

また、記憶媒体からコンピュータにインストールされたプログラムの指示に基づきコンピュータ上で稼働しているＯＳ（オペレーティングシステム）や、データベース管理ソフト、ネットワークソフト等のＭＷ（ミドルウェア）等が上記実施形態を実現するための各処理の一部を実行しても良い。 In addition, an OS (operating system) running on a computer based on an instruction of a program installed in the computer from a storage medium, MW (middleware) such as database management software, network software, and the like realize the above-described embodiment. A part of each process may be executed.

さらに、記憶媒体は、コンピュータと独立した媒体に限らず、ＬＡＮやインターネット等により伝送されたプログラムをダウンロードして記憶または一時記憶した記憶媒体も含まれる。 Furthermore, the storage medium is not limited to a medium independent of the computer, but also includes a storage medium in which a program transmitted via a LAN or the Internet is downloaded and stored or temporarily stored.

また、記憶媒体は１つに限らず、複数の媒体から上記実施形態における処理が実行される場合も本発明における記憶媒体に含まれ、媒体構成は何れの構成であっても良い。 Further, the number of storage media is not limited to one, and the case where the processing in the above embodiment is executed from a plurality of media is also included in the storage media in the present invention, and the media configuration may be any configuration.

尚、コンピュータは、記憶媒体に記憶されたプログラムに基づき、上記実施形態における各処理を実行するものであって、パソコン等の１つからなる装置、複数の装置がネットワーク接続されたシステム等の何れの構成であっても良い。 The computer executes each process in the above embodiment based on a program stored in a storage medium, and includes any one device such as a personal computer or a system in which a plurality of devices are connected to a network. It may be configured as follows.

また、コンピュータとは、パソコンに限らず、情報処理機器に含まれる演算処理装置、マイコン等も含み、プログラムによって本発明の機能を実現することが可能な機器、装置を総称している。 Further, the computer is not limited to a personal computer, but includes an arithmetic processing device, a microcomputer, and the like included in an information processing device, and is a general term for devices and devices that can realize the functions of the present invention by a program.

発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１０…コンピュータ、２０…外部記憶装置、２１…プログラム、３０…文書処理装置、３１，３１１…入力部、３２…構文解析部、３３，３１３…接辞抽出部、３４，３１４…接辞検査部、３５，３１５…複合語抽出部、３６，３１６…出力部、３７，３１７…接辞辞書格納部、３８，３１８…診断規則格納部、３１２…形態素解析部。 DESCRIPTION OF SYMBOLS 10 ... Computer, 20 ... External storage device, 21 ... Program, 30 ... Document processing device, 31, 311 ... Input part, 32 ... Syntax analysis part, 33, 313 ... Affix extraction part, 34, 314 ... Affix check part, 35 , 315 ... compound word extraction unit, 36, 316 ... output unit, 37, 317 ... affix dictionary storage unit, 38, 318 ... diagnostic rule storage unit, 312 ... morpheme analysis unit.

Claims

An input means for inputting document data;
Analyzing means for analyzing a sentence of the document data input by the input means;
Affix extraction means for extracting a predetermined affix from the analysis result by the analysis means;
An affix checking rule that associates a criterion for determining whether a word including the predetermined affix is easily mistranslated with the predetermined affix, and a compound word extraction rule for extracting a compound word including an affix that is easily mistranslated Rules storage means to
A determination unit that determines that the extracted affix is an affix that is likely to be mistranslated when a word including the affix extracted from the analysis result satisfies a determination criterion in the affix check rule;
When a word including an affix extracted from the analysis result satisfies the compound word inspection rule, a compound word extracting unit extracts a compound word including the extracted affix as a compound word that is easily mistranslated according to the rule. And a document processing apparatus.

Affix dictionary storage means for storing an affix dictionary in which a heading of a predetermined affix and a type of affix of the heading are associated;
The affix extraction means includes
2. The document according to claim 1, wherein when a word included in an analysis result by the analyzing unit matches a heading of a predetermined affix in the affix dictionary, the matching word is extracted as the predetermined affix. Processing equipment.

The analysis means includes
Parsing the sentence of the document data input by the input means,
The determination means includes
When a word including the extracted affix from the result of the parsing satisfies the determination criteria in the affix checking rule, the extracted affix is determined to be an affix that is easily mistranslated,
The compound word extraction means includes
When a word including the extracted affix satisfies the compound word check rule from the result of the parsing, the compound word including the extracted affix according to the rule is extracted as a compound word that is easily mistranslated. The document processing apparatus according to claim 1.

The analysis means includes
Perform morphological analysis of the sentence of the document data input by the input means,
The determination means includes
When a word including the extracted affix from the result of the morphological analysis satisfies a determination criterion in the affix check rule, the extracted affix is determined to be an affix that is easily mistranslated,
The compound word extraction means includes
When a word including the extracted affix satisfies the compound word check rule from the result of the morphological analysis, a compound word including the extracted affix according to the rule is extracted as a compound word that is easily mistranslated. The document processing apparatus according to claim 1.

Computer
Input means for inputting document data,
Analyzing means for analyzing a sentence of the document data input by the input means;
Affix extraction means for extracting a predetermined affix from the analysis result by the analysis means,
An affix checking rule that associates a criterion for determining whether a word including the predetermined affix is easily mistranslated with the predetermined affix, and a compound word extraction rule for extracting a compound word including an affix that is easily mistranslated A rule storage means,
When a word including an affix extracted from the analysis result satisfies a determination criterion in the affix check rule, a determination unit that determines that the extracted affix is an affix that is easily mistranslated, and extracted from the analysis result Document processing for functioning as a compound word extraction means for extracting a compound word including the extracted affix as a compound word that is easily mistranslated according to the rule when a word including an affix satisfies the compound word inspection rule program.